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1  Introduction 


The  primary  goal  of  the  MINC  projeet  was  to  eomplete  the  fundamental  researeh  required  to  develop  an 
inferenee  methodology  that  would,  for  the  first  time,  allow  end-systems  and  other  network  elements  to 
determine  the  performanee  eharaeteristies  (sueh  as  loss,  delay,  and  available  bandwidth)  of  both  individual 
links  and  end-end  paths  within  an  internetwork.  The  goal  was  a  methodology  that  did  not  require  the  aetive 
partieipation  of  the  links  and  routers  being  eharaeterized.  The  MINC  projeet  also  had  as  a  goal  to  develop 
measurement  tools  that  would  run  on  the  NIMI  (the  National  Internet  Measurement  Infrastrueture),  and  to 
devlelop  analysis  tools  to  apply  to  the  measurements. 

The  main  outeomes  of  the  projeet  were  the  following 

•  Multicast-based  network  inference  techniques:  We  developed  statistieally  rigorous  estimation  teeh- 
niques  to  determine  link  loss  and  delay,  and  to  identify  the  multieast  tree  topology.  The  teehniques  use 
the  set  of  reeeiver  traees  of  end-end  path  performanee  in  a  multieast  tree  and  exploit  the  eorrelation 
that  multieast  traffie  inherently  eontains  to  derive  link-level  performanee  me  tries. 

•  Unicast- based  network  inference  techniques:  We  also  developed  similar  teehniques,  whieh  rely  on 
unieast  measurements. 

•  Measurement  layout  techniques:  We  developed  effieient  algorithms  for  seleeting  and  laying  out 
low  eost  multieast  distribution  trees  for  the  purpose  of  inferring  the  behavior  of  a  target  set  of  network 
links. 

•  Measurement,  analysis,  and  visualization  tools:  We  developed  end-to-end  measurement  tools  that 
run  on  NIMI  and  a  validated,  web-based  tool,  MINT  (Multieast  Inferenee  Network  Tool)  that  allows 
an  analyst  to  analyze  and  visualize  internal  network  behavior. 

We  deseribe  eaeh  of  these  outeomes  in  the  remainder  of  this  report.  Prior  to  this,  we  motivate  the  need 
for  our  projeet. 

2  Motivation 

As  the  Internet  grows  in  size  and  diversity,  its  internal  performanee  beeomes  ever  more  diffieult  to  measure. 
Any  one  organization  has  administrative  aeeess  to  only  a  small  fraetion  of  the  network’s  internal  nodes, 
whereas  eommereial  faetors  often  prevent  organizations  from  sharing  internal  performanee  data.  End-to- 
end  measurements  using  unieast  traffie  do  not  rely  on  administrative  aeeess  privileges,  but  it  is  diffieult  to 
infer  link-level  performanee  from  them  and  they  require  large  amounts  of  traffie  fo  eover  mulfiple  pafhs. 
There  is,  eonsequenfly,  a  need  for  praefieal  and  effieienf  proeedures  fhaf  ean  fake  an  infernal  snapshof  of  a 
signiheanl  portion  of  fhe  nefwork. 
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source 


Figure  1 :  A  tree  connecting  a  sender  to  two  receivers. 

The  focus  of  the  MINC  project  has  been  the  development  of  measurement  techniques  that  address  these 
problems.  MINC  {Multicast  Inference  of  Network  Characteristics)  uses  end-to-end  multicast  measurements 
to  infer  link-level  loss  rates  and  delay  statistics  by  exploiting  the  inherent  correlation  in  performance  ob¬ 
served  by  multicast  receivers.  These  measurements  do  not  rely  on  administrative  access  to  internal  nodes 
since  they  are  done  between  end  hosts.  In  addition,  they  scale  to  large  networks  because  of  the  bandwidth 
efficiency  of  multicast  traffic. 

Focusing  on  loss  for  the  moment,  the  intuition  behind  packet  loss  inference  is  that  the  event  of  the 
arrival  of  a  packet  to  a  given  internal  node  in  the  tree  can  be  inferred  from  the  packet’s  arrival  at  one  or  more 
receivers  descended  from  that  node.  Conditioning  on  this  latter  event,  we  can  determine  the  probability  of 
successful  transmission  to  and  beyond  the  given  node.  Consider,  for  example  (Figure  1)  a  simple  multicast 
tree  with  a  root  node  (the  source),  two  leaf  nodes  (receivers  and  R2),  a  link  from  the  source  to  a  branch 
point  (the  shared  link),  and  a  link  from  the  branch  point  to  each  of  the  receivers  (the  left  and  right  links). 
The  source  sends  a  stream  of  sequenced  multicast  packets  through  the  tree  to  the  two  receivers.  If  a  packet 
reaches  either  receiver,  we  can  infer  that  the  packet  reached  the  branch  point.  Thus  the  ratio  of  the  number 
of  packets  that  reach  both  receivers  to  the  total  number  that  reached  only  the  right  receiver  gives  an  estimate 
of  the  probability  of  successful  transmission  on  the  left  link.  The  probability  of  successful  transmission  on 
the  other  links  can  be  found  by  similar  reasoning. 

The  remainer  of  this  report  is  organized  as  follows.  We  describe  the  MINC  multicast  methodology 
in  Section  3,  its  adaptation  to  the  use  of  unicast  in  Section  4,  and  the  tree  layout  methodology  (Section  5). 
Following  this,  we  describe  the  measurement  methodology  and  the  analysis  tool,  MINT  (Section  7).  Section 
8  offers  some  conclusions. 

3  Statistical  Methodology 

MINC  works  on  logical  multicast  trees,  i.e.  those  whose  nodes  are  identified  as  branch  points  of  the  physical 
multicast  tree.  A  single  logical  link  between  nodes  of  the  logical  multicast  tree  may  comprise  more  than  one 
physical  link.  MINC  infers  composite  properties  of  the  logical  links.  Henceforth  when  we  speak  of  trees  we 
will  be  speaking  of  logical  multicast  trees. 
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3.1  Loss  Inference 


We  model  paeket  loss  as  independent  aeross  different  links  of  the  tree,  and  independent  between  different 
probes.  Thus,  the  loss  model  assoeiates  with  eaeh  link  k  in  the  tree,  the  probability  Of.  that  a  paeket  reaehes 
the  terminating  node  of  the  link,  also  denoted  by  k,  given  that  it  reaehes  the  parent  node  of  k.  The  link 
loss  probability  is,  then,  (1  —  af.).  Eaeh  reeeiver  reeords  the  outeome  of  eaeh  probe  sent  by  the  souree, 
i.e.,  whether  it  is  reeeived  or  not.  The  a/,  ean  be  expressed  direetly  as  a  funetion  of  the  probabilities  of  all 
possible  outeomes  of  sueeess  and  loss  of  a  probe  at  eaeh  reeeiver.  An  experiment  eonsists  of  a  series  of 
probes  transmitted  from  the  souree.  The  outeome  of  eaeh  probe  at  eaeh  reeeiver  is  reeorded  and  the  link 
probabilities  are  inferred  by  the  estimators  cif.  obtained  by  using  the  aetual  frequeneies  of  the  outeomes.  [4] 
eontains  a  detailed  deseription  and  analysis  of  the  inferenee  algorithm. 

The  estimators  cif.  exhibit  several  desirable  statistieal  properties.  It  was  shown  in  [4]  thatSj;  is  the 
Maximum  Likelihood  Estimator  (MLE)  of  a/,  when  suffieiently  many  probes  are  used.  The  MLE  is  defined 
as  the  set  of  link  probabilities  that  maximizes  the  probability  of  obtaining  the  observed  outeome  frequeneies. 
The  MLE  property  in  turn  implies  two  further  properties  of  a,  namely 

1.  consistency,  ctf.  eonverges  to  the  true  value  af.  almost  surely  as  the  number  of  probes  n  grows  to 
infinity,  and 

2.  asymptotic  normality,  the  distribution  of  the  quantity  —  af.)  eonverges  to  a  normal  distribution 

as  n  grows  to  infinity. 

The  latter  property  implies  that  the  probability  of  an  error  of  a  given  size  in  estimating  a  link  probability 
goes  to  zero  exponentially  fast  in  the  number  of  probes. 

The  eomputation  of  the  af.  is  performed  reeursively  on  the  tree;  the  eomputational  eost  is  linear  in  the 
number  of  probes  and  number  of  nodes  in  the  tree.  Details  of  this  algorithm  are  found  in  [4],  whieh  ean  be 
found  in  the  Appendix. 

Often  we  are  faeed  with  the  problem  of  inferring  the  link  loss  behavior  for  a  eolleetion  of  multieast  trees. 
We  have  developed  two  approaehes  for  handling  this  problem.  The  first  is  to  treat  eaeh  tree  separately  and 
obtain  the  MLE  for  eaeh  segment  eotained  within  eaeh  tree.  If  a  segment  is  found  in  two  or  more  trees,  then 
we  return  the  minimum  variance  weighted  average  (MVWA),  an  estimate  equal  to  the  weighted  average  of 
the  MLEs  obtained  from  the  different  trees.  Here  the  weights  are  taken  to  be  proportional  to  the  varianees 
in  the  link  estimates. 

The  seeond  approaeh  eonsists  of  applying  the  expectation  maximization  (EM)  algorithm  [16]  to  the 
entire  eolleetion  of  trees.  Briefly,  fhe  EM  algorifhm  ealeulafes  fhe  maximum  lieklihood  estimate  of  fhe  link 
probabilities  in  an  iferafive  fashion.  Our  experienee  wifh  bofh  fhe  MVWA  and  EM  algorifhms  suggesfs  fhaf 
mueh  of  fhe  lime  Ihere  is  lillle  differenee  in  fhe  qualify  of  fhe  eslimales  lhal  Ihey  relurn.  However,  when  Ihere 
are  eilher  few  observations  or  fhe  numbers  of  observalions  vary  widely  from  free  lo  free,  fhe  EM  algorifhm 
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produces  a  more  accurate  estimate.  Details  of  these  two  approaches  can  be  found  in  [3] ,  which  can  be  found 
in  the  Appendix. 

Finally,  situations  arise  where  observations  may  be  missing  from  different  subsets  of  receivers  in  a 
tree.  Such  a  situation  is  easily  handled  using  the  EM  algorithm  to  fill  in  the  missing  observations  with 
estimates  based  on  the  observations  that  are  available.  We  have  observed  from  simulation  studies  that  the 
quality  of  the  estimates  produced  when  observations  are  missing  is  close  to  that  for  the  case  of  complete 
observations  unless  the  percantage  of  missing  observations  exceeds  70%  to  80%.  Details  of  the  algorithm 
and  its  evaluation  can  be  found  in  [12],  also  found  in  the  Appendix. 

3.2  Delay  Distribution  Inference 

A  generalization  of  the  loss  inference  methodology  allows  one  to  infer  per  link  delay  distributions.  More 
precisely,  we  infer  the  distribution  of  the  variable  portion  of  the  packet  delay,  what  remains  once  the  link 
propagation  delay  and  packet  transmission  time  are  removed.  Packet  link  delays  are  modeled  as  discrete 
random  variables  that  can  take  one  of  a  finite  number  of  values,  independent  between  different  packets  and 
links.  The  model  is  specified  by  a  finife  sef  of  probabilities  Qk{t)  thaf  a  packef  experiences  delay  t  while 
fraversing  fhe  link  ferminafing  af  node  k,  wifh  infinile  delay  inferprefed  as  loss. 

When  a  probe  is  fransmilled  from  fhe  source,  we  record  eifher  fhe  lime  laken  by  a  probe  fo  reach  each 
receiver  or  lhal  fhe  probe  was  losl.  As  wifh  fhe  loss  inference,  a  probabilistic  analysis  enables  us  lo  relale 
fhe  ak{t)  lo  fhe  probabililies  of  fhe  oulcomes  al  fhe  receivers.  We  infer  fhe  link  delay  probabililies  by  fhe 
eslimalors  ak{t)  obfained  by  using  insfead  fhe  aclual  frequencies  of  fhe  oulcomes  arising  from  fhe  dispafch  a 
number  of  probes.  In  [15],  if  was  shown  lhal  fhe  corresponding  eslimalora(-)  of  fhe  link  delay  dislribulions 
is  slrongly  consislenl  and  asymplolically  normal.  [15]  is  found  in  fhe  Appendix. 

3.3  Delay  Variance  Inference 

The  delay  variance  can  be  direclly  estimated  Consider  fhe  binary  lopology  of  Figure  1 .  Lei  be  Ihe  packel 
delay  on  Ihe  link  emanating  from  Ihe  source,  and  D*,  i  =  1,  2,  Ihe  delay  on  Ihe  link  terminating  al  receiver 
i.  The  end-lo-end  delays  from  Ihe  source  lo  leaf  node  i  =  1,  2,  is  expressed  as  2Q  =  Dq  +  D*.  A  shorl 
calculation  shows  lhal,  wilh  Ihe  assumption  lhal  Ihe  D*  are  independenl,  Var(Do)  =  Cov(Xi,X2).  Thus 
Ihe  variance  of  Ihe  delay  Dq  can  be  estimated  from  Ihe  measured  end-lo-end  delays  from  Ihe  source  lo  Ihe 
leaves.  This  approach  has  been  generalized  lo  estimate  link  delay  variances  in  arbilrary  frees.  Delails  can 
be  found  in  [13],  which  is  is  conlained  wilhin  Ihe  Appendix. 

3.4  Topology  Inference 

In  Ihe  loss  inference  melhodology  described  above,  Ihe  logical  mullicasl  free  was  assumed  lo  be  known 
in  advance.  However,  extensions  of  Ihe  melhod  enable  inference  of  an  unknown  mullicasl  lopology  from 
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end-to-end  measurements.  We  deseribe  briefly  three  approaehes. 

The  first,  whieh  relies  on  loss-based  grouping  was  suggested  in  [21],  in  the  eontext  of  grouping  multieast 
reeeivers  that  share  the  same  set  of  network  bottleneeks  from  the  souree.  The  loss  estimator  of  Seetion  3.1 
estimates  the  shared  loss  to  a  pair  of  reeeivers,  i.e.,  the  eomposite  loss  rate  on  the  eommon  portion  of  the 
paths  from  the  souree,  irrespeetive  of  the  underlying  topology.  Sinee  this  loss  rate  is  larger  the  longer  the 
eommon  path  in  question,  the  aetual  shared  loss  rate  is  maximized  when  the  two  reeeivers  are  siblings. 

A  binary  tree  ean  be  reeonstrueted  iteratively  using  this  approaeh.  Starting  with  the  set  of  reeeiver  nodes 
R,  seleet  the  pair  of  nodes  j,  k  in  R  that  maximizes  the  estimated  shared  loss,  group  them  together  as  the 
eomposite  node.  Iterate  on  this  and  the  set  remaining  nodes  from  R  until  all  are  grouped.  The  algorithm  is 
eonsistent:  the  probability  of  eorreet  identifieation  eonverges  to  one  as  the  number  of  probes  grows;  see  [11]. 
General  (i.e.  non-binary)  trees  ean  be  inferred  by  using  this  algorithm  and  then  transforming  the  resulting 
binary  tree  by  pruning  links  with  inferred  loss  rates  less  than  some  threshold  e.  We  refer  to  this  as  the  binary 
loss  tree  with  pruning  (BLTP)  algorithm. 

A  seeond,  Bayesian  approaeh  assumes  that  the  topology  eomes  from  a  known  prior  distribution.  It  then 
identifies  the  topology  that  minimizes  a  posterior  risk  funetion  assoeiated  with  topologies.  This  approaeh 
requires  the  least  number  of  observations  in  order  to  elassify  the  topology,  provided  that  the  topology  eomes 
from  the  known  prior  distribution. 

The  third,  a  direet  ML  approaeh,  ealeulates  the  maximum  likelihood  of  the  measured  outeomes  over 
all  possible  a/.-  The  topology  that  maximizes  this  quantity  is  ehosen  to  be  our  estimate.  This  elassifier  is 
eonsistent  [11]. 

Our  experienee  with  these  three  approaehes  is  that  the  Bayesian  algorithm  is  slightly  more  aeeurate  than 
the  BLTP  algorithm,  but  at  the  eost  of  signifieant  eomputational  eomplexity.  The  direet  ML  approaeh  does 
not  do  as  well  as  either  of  the  other  algorithms.  Details  of  these  algorithms  and  their  evaluation  ean  be  found 
in  [11].  This  publieation  ean  be  found  in  the  Appendix. 

The  loss-based  grouping  algorithm  approaeh  ean  be  extended  by  replaeing  shared  loss  with  any  funetion 
on  the  nodes  that  (i)  inereases  on  moving  further  from  the  souree;  and  (ii)  whose  value  at  a  given  node  ean  be 
eonsistently  estimated  from  measurements  at  reeeivers  deseended  from  that  node.  The  mean  and  varianee  of 
the  eumulative  delay  from  the  souree  to  a  given  node  exhibit  these  properties.  Henee,  multieast  end-to-end 
delay  measurements  ean  also  be  used  to  infer  the  multieast  topology.  Details  of  this  approaeh  ean  be  found 
in  [9],  whieh  is  eontained  in  the  Appendix. 

Last,  We  have  eombined  the  use  of  losses  and  delays  to  develop  a  hybrid  grouping  algorithm  that  eom- 
bines  the  best  of  BLTP  and  a  delay  based  algorithm.  The  basie  idea  is  as  follows.  At  eaeh  step  where  two 
nodes  are  to  be  eombined,  eandidates  are  seleeted  based  on  losses  and  on  delays.  The  probabilities  of  eaeh 
of  these  being  in  error  are  estimated  and  the  eandidates  that  yield  the  lowest  error  are  ehosen.  In  other  words, 
the  seleetion  is  determined  by  losses  if  this  leads  to  a  lower  probability  of  error  and  by  delays  otherwise. 
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This  approach  is  described  in  [8],  which  is  contained  in  the  Appendix. 


4  Unicast-based  Methods 

At  the  start  of  this  project  we  had  every  expectation  that  multicast  woud  become  pervasive  throughout  the 
Internet.  Unfortunately,  its  deployment  has  proceeded  very  slowly.  Consequently,  we  turned  our  attention 
to  adapting  our  techniques  for  use  with  end-to-end  unicast  measurements. 

In  this  section,  we  describe  our  work  to  adapt  the  multicast  inference  techniques  described  in  Section 
3  to  perform  inference  of  internal  network  characteristics  from  unicast  end-to-end  measurements.  The  data 
for  the  inference  comprises  measured  end-to-end  loss  of  unicast  probes  sent  from  a  source  to  a  number  of 
destinations.  This  is  used  to  infer  the  loss  and  delay  characteristics  of  each  logical  link  of  the  source  tree 
joining  the  source  to  the  destinations,  i.e.,  of  the  composite  paths  between  its  branch  points. 

The  idea  is  to  construct  composite  probes  of  unicast  packets  whose  collective  statistical  properties 
closely  resemble  those  of  a  multicast  packet.  We  shall  speak  of  striping  a  group  of  unicast  packets  across  a 
set  of  destinations.  This  entails  dispatching  the  packets  back-to-back  from  a  source,  each  packet  potentially 
having  a  different  destination  address.  Our  premise  is  that  when  the  duration  of  network  congestion  events 
exceeds  the  temporal  width  of  the  stripe,  packets  should  have  very  similar  experience  of  the  network  upon 
traversing  common  portions  of  the  paths  to  their  destinations.  If  the  experiences  were  identical,  the  packets 
from  a  stripe  that  attempt  to  traverse  a  given  link  would  either  all  be  lost,  or  encounter  identical  delay.  Hence 
the  packet  loss  and  delays  on  a  given  link  would  be  perfectly  correlated  within  a  stripe;  the  composite  probe 
would  have  the  same  statistical  properties  as  a  notional  multicast  packet  that  followed  the  same  source  tree. 
In  this  case  the  methods  presented  in  Section  3  could  be  applied  immediately  to  infer  the  per  link  loss  and 
delay  statistics  of  the  logical  links  source  tree. 

However,  correlations  within  stripes  may  be  less  than  perfect  in  practice.  This  is  because  congestion 
events  may  not  affect  packets  uniformly,  subjecting  stripes  to  dispersion  as  they  travel  through  a  network. 
Some  mechanisms  by  which  this  can  happen  are  the  following.  Packet  loss  will  not  be  uniform  during  loss 
events  that  are  narrower  than  the  stripe,  or  those  that  start  or  stop  while  the  stripe  is  in  progress.  Further¬ 
more,  delays  will  vary  due  to  interleaving  of  background  traffic,  e.g.,  when  moving  from  a  low  to  a  high 
capacity  link.  Although  such  effects  should  be  small  for  sufficiently  narrow  stripes,  they  will  be  cumulative. 
Packet-dropping  on  the  basis  Random  Early  Detection  (RED)  [17]  is  another  mechanism  by  which  packet 
loss  may  become  decorrelated.  It  remains  to  be  seen  whether  this  mechanism  will  be  widely  deployed 
in  communications  networks.  On  the  other  hand,  the  use  of  RED  to  merely  mark  packets  will  not  break 
correlations. 

This  motivated  the  following  four  investigations: 

(i)  determining  the  magnitude  of  imperfect  correlations  through  experiments  on  real  networks; 
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(ii)  calculating  their  likely  impact  on  the  accuracy  of  inference  via  methods  that  assume  perfect  correla¬ 
tions; 

(iii)  adopting  measurement  procedures  that  reduce  the  impact  of  imperfect  correlations; 

(iv)  verifying  the  accuracy  of  the  approach  in  simulations. 

We  extended  the  packet  loss  model  of  Section  3  by  incorporating  an  additional  parameter  for  each  link 
that  describes  the  correlation  of  loss  between  different  packets  of  the  same  stripe.  This  is  done  for  binary 
stripes,  i.e.,  those  comprising  two  packets  with  different  destination  addresses.  These  additional  parameters 
cannot  themselves  be  determined  by  end-to-end  measurements,  at  least  not  without  additional  assumptions 
relating  them  to  each  other,  or  to  the  existing  loss  rate  parameters. 

By  constructing  appropriate  stripes  of  composite  probes  and  selecting  subsets  of  these  probes  for  in¬ 
ference,  we  are  able  to  enhance  correlations  within  data  used  for  inference.  This  is  possible  when  packet 
transmissions  are  correlated  in  the  sense  that  a  given  packet  in  a  stripe  is  more  likely  to  be  transmitted 
when  other  of  its  packets  are  known  to  have  been  transmitted.  By  conditioning  on  the  measurable  event  that 
nearby  packets  have  been  transmitted,  we  raise  the  likelihood  of  transmission  of  a  given  packet  closer  to  1. 
By  sending  the  striped  packets  to  diverse  addresses,  we  can  infer  the  properties  of  internal  network  paths 
from  the  measurements. 

We  evaluated  the  proposed  methods  through  through  measurements  over  the  Internet  and  through  simu¬ 
lation.  We  collected  end-to-end  measurements  on  the  National  Internet  Measurement  Infrastructure  (NIMI) 
[19]  from  a  diverse  set  of  Internet  paths.  We  transmitted  stripes  between  pairs  of  end-hosts  and  verified 
that  their  packet  loss  statistics  were  consistent  with  the  correlation  assumptions  that  underlie  the  method. 
(These  stripes  were  different  from  those  defined  above,  since  all  packefs  in  fhe  sfripes  were  senf  fo  fhe  same 
desfinafion)  We  also  esfimafed  fhe  likely  accuracy  fhaf  would  be  obfained  by  sfripe-based  inference  in  fhe 
nefwork. 

We  supported  fhe  measuremenf  work  fhrough  nefwork  level  simulafion  wifh  ns  [20].  By  insfrumenfing 
fhe  simulafion  we  fraced  fhe  origin  of  end-fo-end  behavior  fo  nefwork  infernal  characferisfics.  This  allowed 
us  firsl  fo  exhibif  fhe  correlafion  properties  of  packef  wifhin  sfripes  as  fhey  are  fransmiffed  across  individual 
links  in  fhe  nefwork  (rafher  fhan  jusf  fhe  end-fo-end  properfies),  and  second  fo  compare  fhe  inferred  link  loss 
rates  wifh  acfual  link  loss  rales.  For  fhe  mosl  accurate  choice  of  slriping  melhod  we  find  fhe  lypical  absolule 
error  in  loss  rafe  inference  fo  be  below  1%. 

Defails  of  Ibis  extension  fo  unicasf  is  found  in  [14],  also  confained  wifhin  fhe  Appendix. 

5  Tree  Layout 

In  fhe  previous  sections,  we  have  described  algorifhms  for  inferring  internal  nefwork  behavior.  In  Ibis 
section  we  describe  algorifhms  developed  during  fhe  projecf  for  choosing  fhe  frees  over  which  fo  perform 
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measurements. 

A  network  is  represented  by  a  directed  graph  N  =  {V,E)  where  V  and  E  denote  the  set  of  nodes  and 
links  within  N  respectively.  Our  interest  is  in  multicast  trees  embedded  within  N.  Let  S'  C  F  be  a  set  of 
possible  multicast  senders,  and  R  C  F  be  a  set  of  possible  multicast  receivers.  Let  T  =  {Vp,  Ep)  denote  a 
directed  (multicast)  tree  with  a  source  »p  and  a  set  of  leaves  rp-  We  require  that  sp  ^  S  and  rp  C  R.  Let 
A  be  a  routing  algorithm  that  produces  the  distribution  tree  A(s,  r)  between  a  source  s  G  S  and  receiver  set 
r  C  R.  Let  T(A,  S,  R)  =  {A(s,  r)  :  s  G  S,  r  C  i.e.,  T(S,  R)  is  the  set  of  all  possible  multicast 

distribution  trees  within  the  network  N  with  sources  from  S  and  receiveers  from  R.  A. 

Associated  with  a  multicast  tree  T  ^  T{S,  R)  is  a  cost 

C{T)  =  C^+  Cu  (1) 

ieE{T) 

where  the  first  term  can  be  thought  of  as  a  “per  tree  cost”  and  the  second  is  a  “per  link  cost”.  For  example, 
IP  multicast  requires  each  multicast  router  to  maintain  per  flow  state.  This  is  accounted  for  by  the  per  tree 
cost.  The  per  link  cost  is  the  cost  for  sending  probe  packets  through  a  link.  The  two  problems  of  interest  to 
us  are  as  follows: 

Multicast  Tree  Identifiability  Problem  (MTIP).  Given  a  set  of  multicast  trees  T'  C  E(S,  R),  and  a  set  of 
links  L  C  E,h  L  identifiable  by  the  set  of  trees  T'? 

Minimum  cost  Multicast  Tree  Cover  Problem  (MMTCP).  Given  S,  R  C  V,  L  C  E  and  L  is  identifiable 
by  T(S',  R),  what  is  the  minimum  cost  subset  of  T{S,  R)  sufficient  to  cover  L?  In  other  words,  find  C 
T (S',  R)  that  covers  L  and  minimizes 

C'(^)  =  ^  C{T) 

We  have  developed  the  following  during  the  project. 

•  We  developed  efficient  algorithms  for  solving  the  identifiability  problems. 

•  We  established  that  the  cover  problem  is  NP-hard  and  that  in  some  cases,  finding  an  approximation 
within  a  certain  factor  of  optimal  is  also  NP-hard.  Thus,  we  also  propose  several  heuristics  and  show 
through  simulation  that  a  greedy  heuristic  that  iteratively  combines  trees  containing  a  small  number 
of  receivers  performs  reasonably  well. 

•  We  developed  polynomial  time  algorithms  that  find  optimal  solutions  for  a  restricted  class  of  network 
topologies,  including  trees.  This  algorithm  can  be  used  to  provide  a  heuristic  for  sparse,  tree  like 
networks.  This  heuristic  is  also  shown  through  simulation  to  perform  well. 

•  We  applied  our  techniques  to  the  vBNS  and  Abilene  network  as  well  as  randomly  generated  networks, 
showing  the  effectiveness  of  the  different  heuristics. 


Details  of  this  work  can  be  found  in  [2],  which  can  be  found  in  the  Appendix. 


6  Experimental  Results 


In  this  section  we  briefly  describe  our  efforts  to  validate  the  MINC  methodology.  Section  6.1  contains  a 
description  of  the  results  of  a  measurement  study  in  which  we  collected  end-to-end  loss  traces  from  the 
MBone  and  validated  the  results  from  inferences  of  loss  rates  collected  using  the  Internet  tool  mtrace. 
Section  6.2  contains  a  description  of  the  results  from  more  detailed  simulation  studies  of  both  loss  and 
delay. 


6.1  Measurement  Experiments 

To  validate  MINC  under  real  network  conditions,  we  performed  a  number  of  measurement  experiments  on 
the  MBone,  the  multicast-capable  subset  of  the  Internet.  Across  our  experiments  we  varied  the  multicast 
sources  and  receivers,  the  time  of  day,  and  the  day  of  the  week.  We  compared  inferred  loss  rates  to  directly 
measured  loss  rates  for  all  links  in  the  resulting  multicast  trees.  The  two  sets  of  quantities  agreed  closely 
throughout. 

During  each  experiment,  a  source  sent  a  stream  of  40  byte,  sequenced  packets  every  100  miiliseconds 
to  a  multicast  group  consisting  of  a  collection  of  receivers  over  the  course  of  one  hour.  The  resulting  traffic 
stream  placed  less  than  4  Kbps  of  load  on  any  one  MBone  link.  At  each  receiver,  we  made  two  sets  of 
measurements  on  this  traffic  stream  using  the  mtrace  and  mbat  software  tools. 

We  used  mtrace  to  determine  the  topology  of  the  multicast  tree,  mtrace  traces  the  reverse  path  from 
a  multicast  source  to  a  receiver.  It  runs  at  the  receiver  and  issues  trace  queries  that  travel  hop  by  hop  along 
the  multicast  tree  towards  the  source.  Each  router  along  the  path  responds  to  these  queries  with  its  own  IP 
address.  We  determined  the  tree  topology  by  combining  this  path  information  for  all  receivers. 

We  also  used  mtrace  to  measure  per-link  packet  losses.  Routers  also  respond  to  mtrace  queries  with 
a  count  of  how  many  packets  they  have  seen  directed  to  the  specified  multicast  group,  mtrace  calculates 
packet  losses  on  a  link  by  comparing  the  packet  counts  returned  by  the  two  routers  at  either  end  of  the  link. 
We  ran  mtrace  every  two  minutes  during  each  one-hour  experiment.  These  mtrace  queries  were  also 
used  to  verify  that  the  topology  remained  constant  during  each  experiment. 

It  is  important  to  note  that  mtrace  does  not  scale  to  measurements  of  large  multicast  groups  if  used 
in  parallel  from  all  receivers  as  we  describe  here.  Parallel  mtrace  queries  converge  as  they  travel  up  the 
tree.  Enough  such  queries  will  overload  routers  and  links  with  measurement  traffic.  We  used  mtrace  in 
this  way  only  to  validate  MINC  on  relatively  small  multicast  groups. 

We  used  mbat  to  collect  traces  of  end-to-end  packet  losses,  mbat  runs  at  a  receiver,  subscribes  to  a 
specified  multicast  group,  and  records  the  sequence  number  and  arrival  time  of  each  incoming  packet.  We 
ran  mbat  at  each  receiver  for  the  duration  of  each  hour-long  experiment. 

We  then  segmented  the  mbat  traces  into  two-minute  subtraces  corresponding  to  the  two-minute  intervals 
on  which  we  collected  mtrace  measurements.  Einally,  we  ran  our  loss  inference  algorithm  on  each  two- 
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Figure  2:  Multicast  routing  tree  during  our  representative  MBone  experiment. 

minute  interval  and  compared  the  inferred  loss  rates  with  the  directly  measured  loss  rates. 

Here  we  highlight  results  from  a  representative  experiment  on  August  26,  1998.  Figure  2  shows  the 
multicast  routing  tree  in  effect  during  the  experiment.  The  source  was  at  the  U.  of  Kentucky  and  the  receivers 
were  at  AT&T  Labs,  U.  of  Massachusetts,  Carnegie  Mellon  U.,  Georgia  Tech,  U.  of  Southern  California, 
U.  of  California  at  Berkeley,  and  U.  of  Washington.  The  four  branch  routers  were  in  California,  Georgia, 
Massachusetts,  and  New  Jersey. 

Figure  3  shows  that  inferred  and  directly  measured  loss  rates  agreed  closely  despite  a  link  experiencing 
a  wide  range  of  loss  rates  over  the  course  of  a  one-hour  experiment.  Each  short  horizontal  segment  in 
the  graph  represents  one  two-minute,  1,200-probe  measurement  interval.  As  shown,  loss  rates  on  the  link 
between  the  U.  of  Kentucky  and  Georgia  varied  between  4%  and  30%.  Nevertheless,  differences  between 
inferred  and  directly  measured  loss  rates  remained  below  1.5%. 

In  summary,  our  MBone  experiments  showed  that  inferred  and  directly  measured  loss  rates  agreed 
closely  under  a  variety  of  real  network  conditions: 

•  Across  a  wide  range  of  loss  rates  (4%-30%)  on  the  same  link. 

•  Across  links  with  very  low  (<  1%)  and  very  high  (>  30%)  loss  rates. 

•  Across  all  links  in  a  multicast  tree  regardless  of  their  position  in  the  tree. 

•  Across  different  multicast  trees. 

•  Across  time  of  day  and  day  of  the  week. 

Furthermore,  in  all  cases  the  inference  algorithm  converged  to  the  desired  loss  rates  well  within  each 
two-minute,  1,200-probe  measurement  interval. 

A  more  detailed  description  of  the  experiments  and  results  is  found  in  [7],  which  is  found  in  the  Ap¬ 
pendix. 
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Loss  rates  on  link  to  GA 


Figure  3:  Inferred  vs.  actual  loss  rates  on  link  between  UKy  and  GA. 

6.2  Simulation  Experiments 

We  have  performed  more  extensive  validations  of  our  inference  techniques  through  simulation  in  two  dif¬ 
ferent  settings:  the  simulation  of  the  model  with  Bernoulli  losses  and  simulations  of  networks  with  realistic 
traffic.  In  the  model  simulations,  probe  loss  and  delay  obey  the  independence  assumption  of  the  model. 
We  applied  the  inference  algorithm  to  the  end-to-end  measurements  and  compared  the  inferred  and  actual 
model  parameters  for  a  large  set  of  topologies  and  parameter  values.  We  found  that  loss  rates,  mean  delay, 
and  variance  estimates  converged  to  close  to  their  actual  values  with  2,000  probes.  The  number  of  probes 
required  to  accurately  compute  the  entire  delay  distributions  is  higher.  In  our  experiments  we  found  good 
agreement  with  10,000  probes. 

The  second  type  of  experiment  is  based  on  the  ns  [20]  simulator.  Here  delay  and  loss  correspond  to 
queueing  delay  and  queue  overflow  at  network  nodes  as  multicast  probes  compete  with  traffic  generated  by 
TCP/UDP  traffic  sources.  Multicast  probes  are  generated  by  the  source  with  fixed  mean  interarrival  times; 
we  used  CBR  or  Poisson  probes.  We  simulated  different  topologies  with  different  background  traffic  mixes 
comprising  infinite  FTP  sessions  over  TCP  and  exponential  or  Pareto  on-off  UDP  sources.  We  considered 
both  Drop  Tail  and  Random  Early  Detection  (RED)  buffer  discard  methods,  [17]. 

We  compared  the  inferred  loss  and  delay  with  actual  probe  loss  and  delay.  We  found  rapid  convergence 
of  the  estimates  although  with  small  persistent  differences.  We  ahribute  this  to  the  presence  of  spatial 
dependence,  i.e.,  dependence  between  probe  losses  and  delays  on  different  links.  This  can  arise  through 
correlations  in  the  background  traffic  due  to  correlation  arising  from  TCP  dynamics,  e.g.,  synchronization 
between  flows  as  a  result  of  slow-start  after  packet  loss.  We  have  shown  in  [4]  that  small  deviations  from 
the  spatial  independence  assumption  lead  to  only  small  errors  in  inference. 

We  also  found  that  background  traffic  introduces  temporal  dependence  in  probe  behavior,  e.g.,  its  bursti- 
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Estimated  vs.  sample  link  delay  c.c.d.f 


Figure  4:  Inferred  and  Sample  Delay  eedf.  for  a  leaf  link  in  the  topology  of  Figure  2. 

ness  ean  eause  baek-to-baek  probe  losses.  We  have  shown  that  while  temporal  dependenee  ean  deerease 
the  rate  of  eonvergenee  of  the  estimators,  eonsisteney  is  unaffeeted.  In  the  experiments  the  inferred  values 
eonverged  within  2,000  probes  despite  the  presenee  of  temporal  dependenee. 

While  there  is  understanding  of  meehanisms  by  whieh  temporal  and  spatial  dependenee  ean  oeeur,  as 
far  as  we  know  there  are  no  experimental  results  eoneerning  its  magnitude.  We  believe  that  large  or  long 
lasting  dependenee  is  unlikely  in  the  Internet  beeause  of  traffie  and  link  diversity.  Moreover,  we  expeet  loss 
eorrelation  to  be  redueed  by  the  introduetion  of  RED. 

We  also  eompared  the  inferred  probe  loss  rates  with  the  baekground  loss  rates.  The  experiments  showed 
these  to  be  quite  elose,  although  not  as  elose  as  inferred  and  aetual  probe  loss  rates.  We  attribute  this  to  the 
inherent  differenee  in  the  statistieal  properties  of  probe  traffie  and  baekground  traffie. 

To  illusfrafe  fhe  disfribufion  of  delay  inferenee  resulfs,  we  simulafed  fhe  topology  of  fhe  mulfieasf  roufing 
free  shown  in  Figure  2.  In  order  to  eapfure  fhe  heferogeneify  befween  edges  and  eore  of  a  nefwork,  inferior 
links  have  higher  eapaeify  (5Mb/see)  and  propagation  delay  (50ms)  fhan  fhose  af  fhe  edge  (IMb/see  and 
10ms).  Baekground  fraffie  eomprises  infinife  FTP  sessions  and  exponential  on-off  UDP  sourees.  Eaeh  link 
is  modeled  as  a  EIEO  queue  wifh  a  4-paekel  eapaeify.  Real  buffers  are  usually  mueh  larger;  fhe  eapaeify  of 
four  is  used  fo  reduee  fhe  fime  required  to  simulafe  fhe  nefwork.  The  diseard  poliey  is  Drop  Tail.  In  Eigure 
4,  we  plof  fhe  inferred  versus  fhe  sample  eomplemenfary  eumulafive  disfribufion  funefion  (diserefized  in  one 
milliseeond  bins)  for  one  of  fhe  leaf  links,  using  abouf  18,000  Poisson  probes.  The  esfimafed  disfribufion 
elosely  follows  fhe  sample  disfribufion  and  is  quife  aeeurafe  for  fail  probabilifies  greafer  fhan  10"^.  Nofe 
fhaf  fhe  esfimafed  disfribufion  is  nof  always  monofonieally  deereasing.  This  is  beeause  negafive  probabilities 
are  oeeasionally  esfimafed  in  fhe  fail  due  to  an  insuffieienf  number  of  samples.  If  is  worfh  poinfing  ouf  fhaf, 
given  fhe  irregular  shape  of  fhe  sample  disfribufion,  fhe  same  level  of  aeeuraey  would  nof  be  possible  using 
a  paramefrie  model. 

The  evaluation  of  loss-based  feehniques  ean  be  found  in  [4]  and  [5],  bofh  of  whieh  are  eonfained  in  fhe 
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Figure  5:  An  RTCP-based  tool  deployment  example,  on  the  same  topology  as  shown  in  Figure  2,  with 
inferenee  being  performed  at  UMass. 

Appendix.  The  evaluation  of  delay -based  teehniques  ean  be  found  in  [15],  also  eontained  in  the  Appendix. 

7  Measurement  and  Analysis  Tools 

We  have  developed  two  sets  of  tools  by  whieh  to  deploy  the  MINC  methodology.  First,  we  have  developed 
tools  that  leverage  off  of  the  real-time  transport  protoeol,  RTF,  and  its  assoeiated  eontrol  protoeol,  RTCP, 
for  generating  and  eolleeting  end-to-end  multieast  measurement  traees.  Our  RTP-based  tool  is  described  in 
Section  7.1.  Second,  we  developed  a  web-based  analysis  and  visualization  tool,  MINT  (Multicast  Inference 
Network  Tool)  for  inferring  loss  behavior.  This  tool  is  described  in  Section  7.2. 

7.1  Integration  with  RTCP 

We  have  developed  tools  for  applying  MINC  in  real-time,  so  that  MINC  can  be  used  by  applications  to 
respond  to  changing  network  conditions  in  new  and  more  sophisticated  ways.  For  example,  a  management 
program  might  adaptively  adjust  its  probes  to  home  in  on  a  problem  router. 

Our  tools  transmit  network  information  using  RTCP,  the  control  protocol  for  multicast  transport  protocol 
RTP  [22].  By  sharing  their  traces  using  RTCP,  they  benefit  from  RTCP’s  built-in  scaling  mechanisms. 

The  approach  is  based  on  three  tools:  mgen,  mflect,  and  mmerge  (Figure  5).  mgen  generates 
a  stream  of  data  (and  may  be  replaced  by  any  other  application  that  multicasts  data  over  RTP).  A  copy 
of  mflect  at  each  receiver  maintains  traces  of  the  packets  it  does  and  does  not  receive  from  mgen.  It 
periodically  multicasts  these  (in  a  sense  reflecting  the  data  stream:  hence  “mflect”).  mmerge  collects  the 
traces  sent  by  mflect,  collating  those  from  the  different  data  receivers  and  making  them  available  to  a 
tool,  such  as  MINT,  for  inference. 

mflect  and  mmerge  are  designed  so  that  they  may  be  incorporated  directly  into  existing  and  future 
multicast  applications.  Their  joint  functionality  is  available  as  an  extension  to  the  RTP  common  code  library 
from  University  College  London,  called  RTPXR,  (“extended  Reporting”).  An  application  using  RTPXR 
would  be  in  a  position  to  respond  adaptively  to  information  on  the  topology  of  its  data  distribution  tree. 

In  order  that  these  tools  scale  to  large  numbers  of  receivers,  we  have  developed  thinning  and  compression 
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Figure  6:  MINT  view  of  the  logical  multicast  tree  with  losses. 


techniques.  Thinning  of  receiver  reports  is  necessary  in  order  to  avoid  the  abuse  of  bandwidth  by  receivers. 
We  developed  a  coordinated  thinning  algorithm,  which  attempts  to  maximize  the  overlap  between  receiver 
reports,  i.e.,  have  receivers  report  on  the  same  set  of  probes.  Compression  can  further  reduce  bandwidth 
usage.  Details  of  the  approach  can  be  found  in  [6] ,  which  is  contained  in  the  Appendix. 

7.2  MINT:  Multicast  Inference  Network  Tool 

MINT  is  intended  to  facilitate  multicast-based  inference.  It  takes  as  inputs  all  of  the  traces  collected  from 
the  end-hosts.  These  traces  may  or  may  not  include  mtrace  outputs.  MINT  comprises  three  components: 
a  web-based  user  interface,  a  topology  discovery  algorithm,  and  an  inference  engine.  Users  interact  with 
MINT  to  manipulate  the  inference  such  as  choosing  number  of  samples,  visualizing  the  multicast  tree  with 
losses  or  showing  the  performance  evolution  over  specific  links.  Depending  on  the  availability  of  mtrace 
output,  MINT  discovers  the  topology  by  either  parsing  mtrace  inputs  or  inferring  the  multicast  tree  from 
the  loss  traces.  The  inference  engine  takes  topology  information  and  loss  traces  to  infer  the  network  internal 
loss  and  then  provides  this  to  the  user.  The  user  can  then  view  the  results  in  one  of  several  ways.  One  way  is 
to  lay  out  the  logical  multicast  tree  and  display  the  links  in  different  colors  to  distinguish  different  average 
loss  rates  (e.g..  Figure  6).  The  user  can  also  focus  on  a  single  link  and  observe  how  the  loss  rate  evolves 
over  time  for  that  link. 


8  Conclusions 

We  have  presented  the  results  obtained  from  our  research  on  the  use  of  end-end  measurements  to  infer  inter¬ 
nal  network  behavior.  We  aslo  described  several  tools  obtained  from  this  effort  for  obtaining  measurements 
and  for  analyzing  and  visualizing  the  results  from  this  analysis.  All  of  the  algorithms  and  their  detailed  eval¬ 
uation  can  be  found  in  the  Appendix.  Last,  it  is  worth  noting  that  this  project  has  had  considerable  impact 
on  research  in  this  area.  It  constitutes  the  first  effort  to  develop  a  rigorous  foundation  on  top  of  which  to  de- 
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sign  and  evaluate  end-to-end  based  network  management  tools.  Many  efforts  are  now  under  way  exploring 
various  extensions  to  the  approaehes  developed  under  MINC.  These  inelude  efforts  at  Riee,  Boston  Univ., 
UPenn,  Lueent  Bell  Labs,  Teehnion  among  others. 


9  Bibliography 
References 

[1]  A.  Adams,  T.  Bu,  R.  Caceres,  N.G.  Duffield,  T.  Friedman,  J.  Horowitz,  F.  Lo  Presti,  S.B.  Moon,  V. 
Paxson,  D.  Towsley.  “The  Use  of  End-to-End  Multieast  Measurements  for  Charaeterizing  Internal 
Network  Behavior,”  IEEE  Communications  Magazine,  May  2000. 

[2]  M.  Adler,  T.  Bu,  R.  K.  Sitaraman,  D.  Towsley.  “Tree  Eayout  for  Internal  Network  Charaeterizations  in 
Multieast  Networks”,  Proc.  3rd  Inti  Workshop  on  Networked  Group  Communication,  Nov.  2001.  Also 
available  as  UMass  CMPSCI  Teehnieal  Report  00-44. 

[3]  T.  Bu,  N.  Duffield,  E.  Eo  Presti,  D.  Towsley.  “Network  Tomography  on  General  Topologies,”  To  appear 
in  Proc.  SIGMETRICS  2002. 

[4]  R.  Caeeres,  N.G.  Duffield,  J.  Horowifz,  D.  Towsley,  “Mulfieasf-Based  Inferenee  of  Nefwork-Infernal 
Eoss  Charaeferisfies,”  IEEE  Transactions  on  Information  Theory,  Nov.  1999. 

[5]  R.  Caeeres,  N.G.  Duffield,  J.  Horowifz,  D.  Towsley,  T.  Bu,  “Mulfieasf-Based  Inferenee  of  Nefwork- 
Infernal  Charaeferisfies:  Aeeuraey  of  Paekef  Eoss  Esfimafion,”  Proc.  IEEE  Infocom  ’99,  Mareh  1999. 

[6]  R.  Caeeres,  N.G.  Duffield,  T.  Eriedman.  “Imprompfu  Measuremenf  Infraslruelures  using  RTP”.  To 
appear  in  Proc.  IEEE  INEOCOM’02,  June  2002. 

[7]  R.  Caeeres,  N.G.  Duffield,  S.B.  Moon,  D.  Towsley.  “Inferenee  of  Infernal  Eoss  Rales  in  Ihe  MBone,” 
Proc.  lEEE/ISOC  Global  Internet  ’99,  Deeember  1999. 

[8]  N.G.  Duffield,  J.  Horowifz,  E.  Eo  Presli.  “Adaplive  mullieasl  topology  inferenee,”  Proc.  IEEE  Infocom 
2001,  April  2001. 

[9]  N.G.  Duffield,  J.  Horowifz,  E.  Eo  Presfi,  D.  Towsley.  “Mullieasl  Topology  Inferenee  from  End-lo-end 
Measuremenls,”  Proc.  IP  Traffic  Measurement,  Modeling  and  Management,  Sepl.  2000. 

[10]  N.G.  Duffield,  J.  Horowifz,  E.  Eo  Presli,  D.Towsley.  “Nelwork  Delay  Tomography  from  End-to-end 
Unieasl  Measuremenls,”  Proc.  of  the  2001  International  Workshop  on  Digital  Communications  2001- 
Evolutionary  Trends  of  the  Internet,  Taormina,  Ifaly,  Sepl  17-20,  2001. 


15 


[11]  N.G.  Duffield,  J.  Horowitz,  F.  Lo  Presti,  D.Towsley.  “Multicast  Topology  Inference  from  Measured 
End-to-End  Eoss,  IEEE  Trans,  on  Information  Theory,  2002. 

[12]  N.G.  Duffield,  J.  Horowitz,  D.  Towsley,  W.  Wei,  T.  Eriedman.  “Multicast-Based  Eoss  Inference  with 
Missing  Data”,  To  appear  in  IEEE  Journal  on  Selected  Areas  in  Communications,  2002. 

[13]  N.G.  Duffield,  E.  Eo  Presfi,  “Mulficasf  Inference  of  Packef  Delay  Variance  af  Inferior  Nefworks  Einks”, 
Proc.  Infocom  ’2000,  Tel  Aviv,  Israel,  March  26-30,  2000. 

[14]  N.G.  Duffield,  E.  Eo  Presfi,  V.  Paxson,  D.  Towsley.  “Inferring  link  loss  using  sfriped  unicasf  probes,” 
Proc.  IEEE  Infocom  2001,  April  2001. 

[15]  E.  Eo  Presfi,  N.G.  Duffield,  J,  Horowifz,  D.  Towsley.  “Mulficasf-Based  Inference  of  Nefwork-Infernal 
Delay  Disfribufions,”  UMass  CMPSCI  Technical  Reporf  99-55,  1999. 

[16]  G.E.  MacEachlan,  T.  Krishnan.  The  EM  algorithm  and  extensions.  John  Wiley,  New  York,  1997. 

[17]  S.  Eloyd,  V.  Jacobson.  “Reandom  Early  Defecfion  Gateways  for  Congestion  Avoidance,”  lEEE/ACM 
Trans,  on  Networking,  1(4):397-413,  Aug.  1993. 

[18]  mtrace:  Prinf  mulficasl  pafh  from  a  source  lo  a  receiver,  f  tp  :  /  /f  tp  .  pare  .  xerox .  com/pub/ 
net-research/ ipmulti 

[19]  V.  Paxson,  J.  Mahdavi,  A.  Adams,  M.  Malhis,  “An  Archileclure  for  Earge-Scale  Inlernel  Measure- 
menl,”  IEEE  Communications,  36(8):48-54,  Augusl  1998. 

[20]  ns:  Nelwork  simulator,  http://www-mash.cs.berkeley.edu/ns 

[21]  S.  Ralnasamy,  S.  McCanne,  “Inference  of  Mulficasl  Routing  Tree  Topologies  and  Bollleneck  Band- 
widlhs  Using  End-to-End  Measuremenfs,”  Proc.  IEEE  Infocom  ’99,  March  1999. 

[22]  H.  Schulzrinne,  S.  Casner,  R.  Erederick,  V.  Jacobson,  “RTP:  A  Transporl  Profocol  for  Real-Time 
Applications,”  REG  1889,  January  1996. 


16 


Appendix  A  Papers  and  Reports 

The  following  papers  are  ineluded  in  the  Appendix.  They  ean  also  be  aeeessed  through  the  MINC  web  site 
at  http::www-net.es.umass.edu/mine 

1.  R.  Caceres,  N.G.  Duffield,  J.  Horowitz,  D.  Towsley,  “Multicast-Based  Inference  of  Network-Internal  Loss 
Characteristics,”  IEEE  Transactions  on  Information  Theory,  Nov.  1999. 

2.  T.  Bu,  N.  Duffield,  F.  Lo  Presti,  D.  Towsley.  “Network  Tomography  on  General  Topologies,”  To  appear  in  Proc. 
SIGMETRICS  2002. 

3.  N.G.  Duffield,  J.  Horowitz,  D.  Towsley,  W.  Wei,  T.  Friedman.  “Multicast-Based  Loss  Inference  with  Missing 
Data”,  To  appear  in  IEEE  Journal  on  Selected  Areas  in  Communications,  2002. 

4.  F.  Lo  Presti,  N.G.  Duffield,  J,  Horowitz,  D.  Towsley.  “Multicast-Based  Inference  of  Network-Internal  Delay 
Distributions,”  UMass  CMPSCI  Technical  Report  99-55,  1999. 

5.  N.G.  Duffield,  F.  Lo  Presti,  “Multicast  Inference  of  Packet  Delay  Variance  at  Interior  Networks  Links”,  Proc. 
Infocom’2000,  Tel  Aviv,  Israel,  March  26-30,  2000. 

6.  N.G.  Duffield,  J.  Horowitz,  F.  Lo  Presti,  D.Towsley.  “Multicast  Topology  Inference  from  Measured  End-to-End 
Loss,  IEEE  Trans,  on  Information  Theory,  2002. 

7.  N.G.  Duffield,  J.  Horowitz,  E.  Lo  Presti,  D.  Towsley.  “Multicast  Topology  Inference  from  End-to-end  Measure¬ 
ments,”  Proc.  IP  Traffic  Measurement,  Modeling  and  Management,  Sept.  2000. 

8.  N.G.  Duffield,  J.  Horowitz,  E.  Lo  Presti.  “Adaptive  multicast  topology  inference,”  Proc.  IEEE  Infocom  2001, 
April  2001. 

9.  N.G.  Duffield,  P.  Lo  Presti,  V.  Paxson,  D.  Towsley.  “Inferring  link  loss  using  striped  unicast  probes,”  Proc. 
IEEE  Infocom  2001,  April  2001. 

10.  M.  Adler,  T.  Bu,  R.  K.  Sitaraman,  D.  Towsley.  “Tree  Layout  for  Internal  Network  Characterizations  in  Multicast 
Networks”,  Proc.  3rd  Inti  Workshop  on  Networked  Group  Communication,  Nov.  2001 .  Also  available  as  LFMass 
CMPSCI  Technical  Report  00-44. 

11.  R.  Caceres,  N.G.  Duffield,  S.  B.  Moon,  D.  Towsley.  “Inference  of  Internal  Loss  Rates  in  the  MBone,”  Proc. 
lEEE/ISOC  Global  Internet  ’99,  December  1999. 

12.  R.  Caceres,  N.G.  Duffield,  J.  Horowitz,  D.  Towsley,  T.  Bu,  “Multicast-Based  Inference  of  Network-Internal 
Characteristics:  Accuracy  of  Packet  Loss  Estimation,”  Proc.  IEEE  Infocom  ’99,  March  1999. 


17 


Multicast-Based  Inference  of  Network-Internal  Loss  Characteristics* 


R.  Caceres^  N.G.  Duffield+  J.  Horowitz^  D.  Towsley^ 


Abstract 

Robust  measurements  of  network  dynamics  are  increasingly  important  to  the  design  and  operation  of 
large  internetworks  like  the  Internet.  However,  administrative  diversity  makes  it  impractical  to  monitor 
every  link  on  an  end-to-end  path.  At  the  same  time,  it  is  difficult  to  determine  the  performance  character¬ 
istics  of  individual  links  from  end-to-end  measurements  of  unicast  traffic.  In  this  paper,  we  introduce  the 
use  of  end-to-end  measurements  of  multicast  traffic  to  infer  network-internal  characteristics.  The  band¬ 
width  efficiency  of  multicast  traffic  makes  it  suitable  for  large-scale  measurements  of  both  end-to-end 
and  internal  network  dynamics. 

We  develop  a  Maximum  Likelihood  Estimator  for  loss  rates  on  internal  links  based  on  losses  ob¬ 
served  by  multicast  receivers.  It  exploits  the  inherent  correlation  between  such  observations  to  infer  the 
performance  of  paths  between  branch  points  in  the  tree  spanning  a  multicast  source  and  its  receivers. 
We  derive  its  rate  of  convergence  as  the  number  of  measurements  increases,  and  we  establish  robustness 
with  respect  to  certain  generalizations  of  the  underlying  model.  We  validate  these  techniques  through 
simulation  and  discuss  possible  extensions  and  applications  of  this  work. 


1  Introduction 


Background  and  Motivation.  Fundamental  ingredients  in  the  successful  design,  control  and  management 
of  networks  are  mechanisms  for  accurately  measuring  their  performance.  Two  approaches  to  evaluating 
network  performance  have  been: 

(i)  Collecting  statistics  at  internal  nodes  and  using  network  management  packages  to  generate  link-level 
performance  reports;  and 

(ii)  Characterizing  network  performance  based  on  end-to-end  behavior  of  point-to-point  traffic  such  as 
that  generated  by  TCP  or  UDP. 

A  significant  drawback  of  the  first  approach  is  that  gaining  access  to  a  wide  range  of  routers  in  an  ad¬ 
ministratively  diverse  network  can  be  difficult.  Introducing  new  measurement  mechanisms  into  the  routers 
themselves  is  likewise  difficult  because  it  requires  persuading  large  companies  to  alter  their  products.  Also, 

*This  work  was  sponsored  in  part  by  the  DARPA  and  the  Air  Force  Research  Laboratory  under  agreement  F30602-98-2-0238. 
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the  composition  of  many  such  small  measurements  to  form  a  picture  of  end-to-end  performance  is  not  com¬ 
pletely  understood. 

Regarding  the  second  approach,  there  has  been  much  recent  experimental  work  to  understand  the  phe¬ 
nomenology  of  end-to-end  performance  (e.g.,  see  [1,  2,  14,  19,  21,  22]).  A  number  of  ongoing  measurement 
infrastructure  projects  (Felix  [5],  IPMA  [7],  NIMI  [13]  and  Surveyor  [28])  aim  to  collect  and  analyze  end- 
to-end  measurements  across  a  mesh  of  paths  between  a  number  of  hosts,  pathchar  [10]  is  under  evaluation 
as  a  tool  for  inferring  link-level  statistics  from  end-to-end  point-to-point  measurements.  However,  much 
work  remains  to  be  done  in  this  area. 

Contribution.  In  this  paper,  we  consider  the  problem  of  characterizing  link-level  loss  behavior  within  a 
network  through  end-to-end  measurements.  We  present  a  new  approach  based  on  the  measurement  and 
analysis  of  the  loss  behavior  of  multicast  probe  traffic.  The  key  to  this  approach  is  that  multicast  traffic 
introduces  correlation  in  the  end-to-end  losses  measured  by  receivers.  This  correlation  can,  in  turn,  be  used 
to  infer  the  loss  behavior  of  the  links  within  the  multicast  routing  tree  spanning  the  sender  and  receivers. 
This  enables  the  identification  of  links  with  higher  loss  rates  as  candidates  for  the  origin  of  the  degradation 
of  end-to-end  performance. 

Using  this  approach,  we  develop  maximum  likelihood  estimators  (MLEs)  of  the  link  loss  rates  within 
a  multicast  tree  connecting  the  sender  of  the  probes  to  a  set  of  receivers.  These  estimates  are,  initially, 
derived  under  the  assumption  that  link  losses  are  described  by  independent  Bernoulli  losses,  in  which  case 
the  problem  is  that  of  estimating  the  link  loss  rates  given  the  end-to-end  losses  for  a  series  of  n  probes.  We 
show  that  these  estimates  are  strongly  consistent  (converge  almost  surely  to  the  true  loss  rates).  Moreover, 
the  asymptotic  normality  property  of  MLEs  allows  us  to  derive  an  expression  for  their  rate  of  convergence 
to  the  true  rates  as  n  increases. 

We  evaluate  our  approach  for  two-,  four-,  and  eight-receiver  populations  through  simulation  in  two 
settings.  In  the  first  type  of  experiment,  link  losses  are  described  by  time-invariant  Bernoulli  processes. 
Here  we  find  rapid  convergence  of  the  estimates  to  their  actual  values  as  the  number  of  probes  increases. 
The  second  type  of  experiment  is  based  on  ns  [18]  simulations  where  losses  are  due  to  queue  overflows  as 
probe  traffic  competes  with  other  traffic  generated  by  infinite  data  sources  that  use  the  Transmission  Control 
Protocol  (TCP)  [24].  In  the  two-  and  four-  receiver  topologies  with  few  background  connections  we  find  fast 
convergence  although  there  are  persistent,  if  small,  differences  between  the  inferred  and  actual  loss  rates. 

The  cause  of  these  differences  is  that  losses  in  our  simulated  network  display  spatial  dependence  (i.e., 
dependence  between  links),  which  violates  the  Bernoulli  assumption.  We  believe  that  large  and  long-lasting 
spatial  dependence  is  unlikely  in  a  real  network  such  as  the  Internet  because  of  its  traffic  and  link  diversity. 
This  is  supported  by  experiments  with  an  eight-receiver  topology  with  diverse  background  traffic  in  which 
we  found  closer  agreement  between  inferred  and  actual  loss  rates.  Eurthermore,  we  believe  that  the  intro- 
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duction  of  Random  Early  Detection  (RED)  [6]  policies  in  Internet  routers  will  help  break  such  dependence. 

The  potential  for  both  spatial  and  temporal  dependence  of  loss  motivates  investigation  into  their  effect. 
Our  analysis  shows  that  dependence  introduces  inference  errors  in  a  continuous  manner:  if  the  dependence 
small,  the  errors  in  the  estimates  are  also  small.  Eurthermore,  the  errors  are  a  second  order  effect:  in  the 
special  case  of  a  binary  tree  with  statistically  identical  dependent  loss  on  sibling  links,  the  Bernoulli  MEE 
of  the  marginal  loss  rates  are  actually  unaffected  for  interior  links  of  the  tree.  More  generally,  the  MEE  will 
be  insensitive  to  spatial  dependence  of  loss  within  regions  of  similar  loss  characteristics.  Eurthermore,  the 
analysis  shows  how  prior  knowledge  of  the  likely  magnitude  of  dependence-e.g.  from  independent  network 
measurements-could  be  used  to  correct  the  Bernoulli  MEE. 

We  note  that  interference  from  TCP  sources  introduces  temporal  dependence  (i.e.,  dependence  between 
different  packets)  that  also  violates  the  Bernoulli  assumption.  This  dependence  is  apparent  in  our  simulated 
network,  where  probe  losses  often  occur  back-to-back  due  to  burstiness  in  the  competing  TCP  streams. 
Such  dependence  has  also  been  measured  in  the  Internet,  but  rarely  involves  more  than  a  few  consecutive 
packets  [1].  The  consistency  of  the  estimator  does  not  require  independence  between  probes;  it  is  sufficient 
that  the  loss  process  be  ergodic.  This  property  holds,  e.g.,  when  the  dependence  between  losses  has  suffi¬ 
ciently  short  range.  However,  the  rate  of  convergence  of  the  estimates  to  their  true  values  will  be  slower. 
We  quantify  this  for  Markovian  losses  by  applying  the  Central  Eimit  Theorem  for  the  occupation  times  of 
Markov  processes.  We  use  this  approach  to  compare  the  efficacy  of  two  sampling  strategies  in  the  presence 
of  Markovian  losses.  In  our  experiments,  inferred  loss  rates  closely  tracked  actual  losses  rates  despite  the 
presence  of  temporal  dependence. 

The  work  presented  in  this  paper  assumes  that  the  topology  of  the  multicast  tree  is  known  in  advance. 
We  are  presently  developing  algorithms  to  infer  the  multicast  tree  from  the  probe  measurements  themselves. 

We  envisage  deploying  inference  engines  as  part  of  a  measurement  infrastructure  comprising  hosts  ex¬ 
changing  probes  in  a  WAN.  Each  host  will  act  as  the  source  of  probes  down  a  multicast  tree  to  the  others. 
A  strong  advantage  of  using  multicast  rather  than  unicast  traffic  is  efficiency.  N  multicast  servers  produce 
a  network  load  that  grows  at  worst  linearly  as  a  function  of  N.  On  the  other  hand,  the  exchange  of  unicast 
probes  can  lead  to  local  loads  which  grow  as  N"^,  depending  on  the  topology.  We  illustrate  this  in  Eigure  1. 
In  this  example,  2N  servers  exchange  probes.  Eor  unicast  probes,  the  load  on  central  link  grows  as  ;  for 
multicast  probes  it  grows  only  as  2N. 

Related  Work.  There  are  a  number  of  measurement  infrastructure  projects  in  progress,  all  based  on  the  ex¬ 
change  of  unicast  probes  between  hosts  in  the  current  Internet.  Two  of  these,  IPMA  (Internet  Performance 
Measurement  and  Analysis)  [7]  and  Surveyor  [28],  focus  on  measuring  loss  and  delay  statistics;  in  the  for¬ 
mer  between  public  Internet  exchange  points,  in  the  latter  between  hosts  deployed  at  sites  participating  in 
Internet  2.  A  third,  Eelix  [5],  is  developing  linear  decomposition  techniques  to  discover  network  topology. 
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host  N+1 


host  2N 


Figure  1:  PROBE  Method,  Load  and  Topology:  2N  servers  exchange  probes.  For  unicast  probes, 
load  on  central  link  grows  as  for  multicast  probes  it  grows  only  as  2N. 


with  an  emphasis  on  network  survivability.  A  fourth,  NIMI  (National  Internet  Measurement  Infrasttuc- 
ture)  [13],  concentrates  on  building  a  general-purpose  platform  on  which  a  variety  of  measurements  can  be 
carried  out.  These  infrastructure  efforts  emphasize  the  growing  importance  of  network  measurements  and 
help  motivate  our  work.  We  believe  our  multicast-based  techniques  would  be  a  valuable  addition  to  these 
measurement  platforms. 

There  is  a  multicast-based  measurement  tool,  mtrace  [16],  already  in  use  in  the  Internet,  mtrace  reports 
the  route  from  a  multicast  source  to  a  receiver,  along  with  other  information  about  that  path  such  as  per-hop 
loss  and  delay  statistics.  Topology  discovery  through  mtrace  is  performed  as  part  of  the  tracer  tool  [12]. 

However,  mtrace  suffers  from  performance  and  applicability  problems  in  the  context  of  large-scale  mea¬ 
surements.  First,  mtrace  traces  the  path  from  the  source  to  a  single  receiver  by  working  back  through  the 
multicast  tree  starting  at  that  receiver.  In  order  to  cover  the  complete  multicast  tree,  mtrace  would  need  to 
run  once  for  each  receiver,  which  does  not  scale  well  to  large  numbers  of  receivers.  In  contrast,  the  inference 
techniques  described  in  this  paper  cover  the  complete  tree  in  a  single  pass.  Second,  mtrace  relies  on  mul¬ 
ticast  routers  to  respond  to  explicit  measurement  queries.  Current  routers  support  these  queries.  However, 
Internet  service  providers  may  choose  to  disable  this  feature  since  it  gives  anyone  access  to  detailed  delay 
and  loss  information  about  paths  in  their  part  of  the  network.  In  contrast,  our  inference  techniques  do  not 
rely  on  cooperation  from  any  network-internal  elements. 

We  now  turn  our  attention  to  related  theoretical  work  on  inference  methodologies.  There  has  been 
some  ad  hoc,  statistically  non-rigorous  work  on  deriving  link-level  loss  behavior  from  end-to-end  multicast 
measurements.  An  estimator  proposed  in  [33]  attributes  the  absence  of  a  packet  at  a  set  of  receivers  to  loss 
on  the  common  path  from  the  source.  However,  this  is  biased,  even  as  the  number  of  probes  n  goes  to 
infinity. 

For  a  different  problem,  some  analytic  methods  for  inference  of  traffic  mafrices  have  been  proposed 
quife  recenfly  [30,  31].  The  focus  of  fhese  sfudies  was  fo  defermine  fhe  intensifies  of  individual  source- 
desfinafion  flows  from  measuremenfs  of  aggregate  flows  faken  af  a  number  of  poinfs  in  a  nefwork.  Alfhough 
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there  are  formal  similarities  in  the  inference  problems  with  those  of  the  present  paper,  the  problem  addressed 
by  the  other  papers  was  substantially  different.  The  solutions  are  not  always  unique  or  easily  identifiable, 
sometimes  needing  supplementary  methods  to  identify  a  candidate  solution.  This  was  a  consequence  of  a 
combination  of  the  coarseness  of  the  data  (average  data  rates  in  the  class  of  Poissonian  traffic  processes)  and 
the  generality  of  the  network  topology  considered. 

Structure  of  the  Paper.  The  remainder  of  the  paper  is  structured  as  follows.  In  Section  2  we  present  a  loss 
model  for  multicast  trees  and  describe  the  framework  within  which  analysis  will  occur.  Section  3  contains 
the  derivation  of  the  estimators  themselves;  the  specific  example  of  the  two-leaf  tree  is  worked  out  explicitly. 
Section  4  analyzes  the  rates  of  convergence  of  estimators  as  the  number  of  probes  is  increased.  In  particular, 
we  obtain  a  simple  approximation  for  estimator  variance  in  the  regime  of  small  loss  probabilities.  In  Section 
5  we  present  an  algorithm  for  computing  packet  loss  estimates,  and  tests  for  consistency  of  the  data  with  the 
model.  Section  6  presents  the  results  of  simulation  experiments  that  validate  our  approach.  Motivated  in  part 
by  the  experimental  results,  we  continue  by  examining  the  effects  of  violation  of  the  Bernoulli  assumption. 
In  Section  7  we  analyze  the  effects  of  spatial  dependence  on  our  estimators.  We  show  how  to  correct  for 
them  on  the  basis  of  some  a  priori  knowledge  of  their  magnitude;  we  show  that  in  any  case  they  deform 
the  estimates  based  on  the  Bernoulli  assumption  only  to  second  order.  In  Section  8  we  analyze  the  effect 
of  temporal  dependence  on  the  loss  process.  We  show  that  the  asymptotic  accuracy  of  the  Bernoulli-based 
estimator  is  unaffected,  although  it  may  converge  more  slowly.  We  conclude  in  Section  9  with  a  summary 
of  our  contributions  and  proposals  for  further  work.  Some  of  the  proofs  are  deferred  to  Section  10. 

2  Model  &  Framework 

2.1  Description  of  Logical  Multicast  Trees 

Let  T  =  (L,  L)  denote  the  logical  multicast  tree  from  a  given  source,  consisting  of  the  set  of  nodes  V, 
including  the  source  and  receivers,  and  the  set  of  links  L.  A  link  is  ordered  pair  (j,  k)  £  V  X  V  denoting  a 
link  from  node  j  to  node  A;.  The  set  of  children  of  a  node  j  is  denoted  by  (i(  A;)  (i.e.  d{j)  =  {k  £  V  :  {j,k)  G 
L}).  For  each  node  j  E  V  apart  from  the  root  0,  there  is  a  unique  node  k  =  f{j),  the  parent  of  j,  such  that 
(j,  k)  G  L.  We  shall  define  /”(A;)  recursively  by  /”(A;)  =  /(/”“^(A;)).  We  say  that  j  is  a  descendant  of  k 
if  A;  =  /”(j)  for  some  integer  ra  >  0. 

The  root  0  G  L  will  represent  the  source  of  the  probes.  The  set  of  leaf  nodes  R  C  V  (those  with  no 
children)  will  represent  the  set  of  receivers.  The  logical  multicast  tree  has  the  property  that  every  node  has 
at  least  two  descendants,  apart  from  the  root  node  (which  has  one)  and  the  leaf-nodes  (which  have  none). 
On  the  other  hand,  nodes  in  the  full  (as  opposed  to  logical)  multicast  tree  can  have  only  one  descendant. 
The  logical  multicast  tree  is  obtained  from  the  full  multicast  tree  by  deleting  all  nodes  which  have  a  single 
child  (apart  from  the  root  0)  and  adjusting  the  links  accordingly.  More  precisely,  if  i  =  /(j)  =  f^{k)  are 
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Figure  2:  (a)  A  multicast  tree  with  two  receivers,  (b)  The  corresponding  logical  multicast  tree. 

nodes  in  the  full  tree  and  #d{j)  =  1,  then  we  assign  to  the  logical  tree  only  the  nodes  i,  k  and  the  link  {i,k). 
Applying  this  rule  to  all  such  i,  j  and  k  in  the  full  multicast  tree  yields  the  logical  multicast  tree. 

A  two  receiver  example  is  illustrated  in  Figure  2.  A  source  multicasts  a  sequence  of  probes  to  two 
receivers,  i?i  and  i?2.  The  probes  traverse  the  multicast  tree  illustrated  in  Figure  2(a).  Figure  2(b)  illustrates 
the  logical  multicast  tree,  where  each  path  between  branch  points  in  the  tree  depicted  in  Figure  2(a)  has  been 
replaced  by  a  single  logical  link. 

2.2  Modeling  the  Loss  of  Probe  Packets 

We  model  the  loss  of  probe  packets  on  the  logical  multicast  tree  by  a  set  of  mutually  independent  Bernoulli 
processes,  each  operating  on  a  different  link.  Losses  are  therefore  independent  for  different  links  and  dif¬ 
ferent  packets.  In  the  introduction  we  discussed  the  reasons  why  network  traffic  can  be  expected  to  violate 
these  assumptions;  in  Sections  7  and  8  we  discuss  the  extent  to  which  they  affect  the  estimators  described 
below,  and  how  these  effects  can  be  corrected  for. 

We  now  describe  the  loss  model  in  more  detail.  With  each  node  k  £  V  we  associate  a  probability 
Oik  G  [0,1]  that  a  given  probe  packet  is  not  lost  on  the  link  terminating  at  k.  We  model  the  passage  of 
probes  down  the  tree  by  a  stochastic  process  X  =  (X/;)/;gy  where  each  X/;  takes  a  value  in  {0, 1};  X/;  =  1 
signifies  that  a  probe  packet  reaches  node  k,  and  0  that  it  does  not.  The  packets  are  generated  at  the  source, 
so  Xo  =  1.  For  all  other  k  ^  V,  the  value  of  is  determined  as  follows.  If  =  0  then  Xj  =  0  for 
the  children  j  of  k  (and  hence  for  all  descendants  of  k).  If  Xk  =  1,  then  for  j  a  child  of  k,  Xj  =  1  with 
independent  probability  ,  and  Xj  =  0  with  probability  Wj  =  1  —  .  (We  shall  write  1  —  a  as  a  in 

general).  Although  there  is  no  link  terminating  at  0,  we  shall  adopt  the  convention  that  ao  =  1,  in  order  to 
avoid  excluding  the  root  link  from  expressions  concerning  the  a  k  ■  We  display  in  Figures  3  and  4  examples 
of  two-  and  four-leaf  logical  multicast  trees  which  we  shall  use  for  analysis  and  experiments. 
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Figure  3:  A  two-leaf  logical  multicast  tree  Figure  4:  A  four-leaf  logical  multicast  tree 


2.3  Data,  Likelihood,  and  Inference 

In  an  experiment,  a  set  of  probes  is  dispatched  from  the  source.  We  can  think  of  each  probe  as  a  trial,  the 
outcome  of  which  is  a  record  of  whether  or  not  the  probe  was  received  at  each  receiver.  Expressed  in  terms 
of  the  random  process  X,  each  such  outcome  is  the  set  of  values  of  X/,  for  k  in  the  set  of  leaf  nodes  R,  i.e. 
the  random  quantity  =  {Xk)keR’  element  of  the  space  Q  =  {0, 1}^  of  all  such  outcomes.  For  a 
given  set  of  link  probabilities  a  =  (ak) kev,  the  distribution  of  the  outcomes  {X k)  keR  will  be  denoted  by 
Pa.  The  probability  mass  function  for  a  single  outcome  a;  G  Q  is p{x;  a)  =  =  x). 

Let  us  dispatch  n  probes,  and,  for  each  possible  outcome  x  ^  kl,  let  n{x)  denote  the  number  of  probes 
for  which  the  outcome  x  obtained.  The  probability  of  n  independent  observations  a;  ,  a;”  (with  each 
x"^  =  {x'^)keR)  is  then 

n 

p{x^, . . . ,  a;”;  a)  =  p(a;™;  «)  =  f]! 

m=l 

Our  task  is  to  estimate  the  value  of  a  from  a  set  of  experimental  data  {n{x))^^^Q,.  We  focus  on  the 
class  of  maximum  likelihood  estimators  (MLEs):  i.e.  we  estimate  a  by  the  value  a  which  maximizes 
p(a;^, . .  .,a;”;a)  for  the  data  a;  . . . ,  a;”.  Under  very  mild  conditions,  which  are  satisfied  in  the  present  situ¬ 
ation,  MLEs  exhibit  many  desirable  properties,  including  strong  consistency,  asymptotic  normality,  asymp¬ 
totic  unbiasedness,  and  asymptotic  ejficiency  (see  [11]).  Strong  consistency  means  that  MLEs  converge 
almost  surely  (i.e.,  with  probability  1)  to  their  target  parameters  as  the  sample  size  increases.  The  last  three 
properties  mean  that,  if  the  sample  size  is  large,  we  can  compute  confidence  intervals  for  the  parameters 
at  a  given  confidence  level,  the  estimators  are  approximately  unbiased,  and  there  is  no  other  estimator  that 
would  give  the  same  level  of  precision  with  a  smaller  sample  size. 

Because  of  these  properties,  when  a  parametric  model  is  available,  MLEs  are  usually  the  estimators  of 
choice.  Moreover,  the  confidence  intervals  allow  us  to  estimate  the  accuracy  of  the  estimates  of  a,  and  in 
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particular  their  rate  of  convergence  to  the  true  parameter  a  as  the  number  of  samples  n  becomes  large.  This 
is  important  for  understanding  the  number  of  probes  which  must  be  sent  in  order  to  obtain  an  estimate  of  a 
with  some  desired  accuracy.  Furthermore,  in  view  of  the  possibility  of  large  time-scale  fluctuation  in  WANs, 
e.g.  Internet  routing  instabilities  as  reported  by  Paxson  [19],  the  period  over  which  probes  are  sent  should 
not  be  unnecessarily  long. 

3  The  Analysis  of  the  Maximum  Likelihood  Estimator 

In  this  section  we  establish  the  form  of  the  MLE  and  determine  the  rate  at  which  it  converges  to  the  true 
value  as  the  number  of  probes  increases;  this  can  be  used  to  make  prediction  for  given  models,  and  also 
to  estimate  the  likely  accuracy  of  estimates  derived  from  actual  data.  We  work  this  out  completely  for  the 
two-leaf  tree  of  Figure  3. 

3.1  The  Likelihood  Equation  and  its  Solution 

It  is  convenient  to  work  with  the  log-likelihood  function 

C{a)  =  logp(a;^, . . . ,  s”;  a)  =  n[x)  logp(a;;  a),  (2) 

In  the  notation  we  suppress  the  dependence  of  £  on  ra  and  x^, ,  s”.  Since  log  is  increasing,  maximizing 
p{x^, . . . ,  s”;  a)  is  equivalent  to  maximizing  £(a). 

We  introduce  the  notation  that  k  A  k'  for  k,k'  G  V  whenever  A;  is  a  descendant  of  k'  or  k  =  k'  and 
k  <  k'  whenever  k  <  k'  but  k  /  k' .  We  shall  say  that  a  link  k  is  at  level  I  =  l{k)  if  there  is  a  chain  of  I 
ancestors  k  =  /°(A;)  ^  f{k)  -<  p{k)  ...  <  f^ik)  =  0  leading  back  to  the  root  0  of  T.  Levels  0  and  1  have 
only  one  node.  We  will  occasionally  use  U  to  denote  V  \  {0}.  Let  T (k)  =  (V [k)  ,L{k))  denote  the  subtree 
within  T  rooted  at  node  k.  R{k)  =  Rf\V{k)  will  be  the  set  of  receivers  which  are  descended  from  k.  Let 
Q{k)  be  the  set  of  outcomes  x  in  which  at  least  one  receiver  in  R{k)  receives  a  packet,  i.e., 

Q(^k)  =  {a;  G  :  \J  Xj  =  1}.  (3) 

jeR{k) 

Set  jk  =  7fc(«)  =  Pa[^(^)]-  An  estimate  of  jk  is 

%=  P(®)  (4) 

is  the  observed  proportion  of  trials  with  outcome  x.  We  will  show  that  a  can  be  calculated  from  7  = 
ilk) kev,  and  that  the  MLE 

a  =  argmax„g[o^i]#ii£(a)  (5) 
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can  be  calculated  in  the  same  manner  from  the  estimates  7.  The  relation  between  a  and  7  is  as  follows. 
Define  (3^  =  P[D(A;)  |  =  1].  The  (3^  obey  the  recursion 


Pk 

=  ctfc  +  ctfc  (3j,  k  £  V  \  R, 

(6) 

jed{k) 

fk 

=  ak,  k  £  R. 

(7) 

Then 


m 

7fc  =  h  ^S'{k) ■ 
8  =  1 


(8) 


Theorem  1  Let  A  =  {{ak)keu  '■  Ofc  >  0},  andQ  =  {{'^k)keu  '■  Ik  >  OVA;;  7/,  <  J2jed{k)li'^^  V  U\R}. 
There  is  a  bijection  V  from  A  to  Q.  Moreover,  V  andV~^  are  continuously  differentiable. 


The  proof  of  Theorem  1  relies  of  the  following  Lemma  whose  proof  is  given  in  Section  10. 


Lemma  1  Let  C  be  the  set  of  c  =  (cj)j=i^2,...,8max  L'  G  (0, 1)  and  Yhi  L'  >  1-  The  equation  [1  —  x)  = 
]^j(l  —  Cix)  has  a  unique  solution  x[c)  G  (0, 1).  Moreover,  x[c)  is  continuously  differentiable  on  C. 


Proof  of  Theorem  1:  The  7/;  have  been  expressed  as  a  function  of  the  ak,  and  clearly  ak  >  OMk  £  U 
implies  the  conditions  for  Q.  Thus  it  remains  to  show  that  the  mapping  from  7l  to  t?  is  injective.  Let 
^k  =  \t!=l  oiffk)-  From  (8)  we  have 

7fc  =  Ak,  k  e  R,  (9) 

while  combining  (6)  and  (8)  we  find 

Hk{Ak,'y)\=  {l-'yk/Ak)  -  (1  -  7jMfc)  =  0,  keU\R.  (10) 

jed(k) 

Since  Hk{Ak,j)  =  h{^k/Ak,  {"(j/lk  ■  j  G  d{k)})  from  Lemma  1,  there  is  a  unique  Ak  >  jk  which 

solves  (10).  We  recover  the  ak  uniquely  from  the  Ak  by  taking  the  appropriate  quotients  (and  setting 

Aq  =  ao  =  1): 

o^k  =  AkfAff^k)}  k  £U.  (11) 

Clearly  T  is  continuously  differentiable;  that  T"^  is  also  follows  from  the  corresponding  statement  for  x{c) 
in  Lemma  1 .  ■ 


Candidates  for  the  MLE  are  solutions  of  the  likelihood  equation  for  the  stationary  points  a  of  £: 

dC 

(a)  =  0,  keU. 


dak 


(12) 


Theorem  2  When  7  G  C,  the  likelihood  equation  has  the  unique  solution  S  :=  T  ^  (7). 
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Note  that  in  the  notation  we  have  suppressed  the  dependence  of  a  and  a  on  n  and  x^, . . . ,  x^.  We  defer 
the  proof  of  Theorem  2  to  Section  10.  That  done,  we  must  complete  the  argument  hy  showing  that  the 
stationary  point  does  have  maximum  likelihood.  For  this  we  must  impose  additional  conditions,  a  is  not 
precluded  from  being  either  a  minimum  or  a  saddle  for  the  likelihood  function,  the  maximum  falling  on  the 
boundary  of  [0, 1]^^.  For  some  simple  topologies  we  are  able  to  establish  directly  that  £(a)  is  (jointly) 
concave  in  the  parameters  at  a  =  a,  which  is  hence  the  MLE  a.  For  more  general  topologies  we  use  an 
argument  which  establishes  that  3  =  d  for  all  sufficiently  large  n,  and  whose  proof  also  establishes  some 
useful  asymptotic  properties  of  a. 

If  a/,  =  0  for  some  link  k,  then  X/,  =  0  for  all  j  G  R(k),  regardless  of  the  values  of  aj  for  j  descended 
from  k,  and  hence  these  cannot  be  determined.  For  this  reason  we  now  restrict  attention  to  the  case  that  all 
ak  >  0,  by  passing  to  a  subtree  if  necessary;  see  Section  5. 

Theorem  3  Assume  ak  G  (0, 1],  A;  G  U. 

(i)  The  model  is  identifiable,  i.e.,  a,  a'  G  (0, 1]^^  and  =  P„/  implies  a  =  a' . 

(ii)  As  n  ^  oo,  a  ^  a  and  3  — ?■  a,  P„  almost  surely. 

( Hi )  Assume  also  <  1,  k  £  U.  With  probability  1,  for  sufficiently  large  n,  a  =  a. 


Maximum  Likelihood  Estimator  for  the  Two-leaf  Tree  Denotethe4pointsofn  =  {0,l}^by{00,01,10,ll}. 
Then 

7i  =  p(ll) +p(10) +p(01),  72  =  p(ll) +p(10),  %  =  p{ll)  +  p{01) .  (13) 

The  equations  (10)  for  Ak  in  terms  of  the  %  can  be  solved  explicitly;  combining  with  (11)  we  obtain  the 
estimates 


0^2 


as 


7273  ^  (p(01)  +  p(ll))(p(10)  +p(ll)) 

72  +  73  -  71  P(ll) 

72  +  73  -  7i  _  p(ll) 

73  p(01)+p(ll) 

72  +  73  -  7i  _  p(ll) 

72  p(10)+p(ll) 


(14) 

(15) 

(16) 


Note  that  although  it  is  possible  that  3 1  >  1  for  some  finite  n,  this  will  not  happen  when  n  is  sufficiently 
large,  due  to  Theorem  3(ii). 

There  is  a  simple  interpretation  of  the  estimates  in  (15)  and  (16).  With  the  p’s  replaced  by  their  corre¬ 
sponding  true  probabilities  p,  (15)  would  give  the  probability  of  receiving  a  probe  at  node  1,  given  that  it 
known  to  be  received  at  node  2.  For  independent  losses,  this  is  just  the  marginal  probability  that  the  probe  is 
received  at  node  1.  We  have  found,  however,  the  corresponding  formulas  when  there  are  more  than  2  sibling 
nodes  do  not  allow  such  a  direct  interpretation. 
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4  Rates  of  Convergence  of  Loss  Estimator 

4.1  Large  Sample  Behavior  of  the  Loss  Estimator 


In  this  section  we  examine  in  more  detail  the  rate  of  convergence  of  a  and  the  MLE  a  to  the  true  value  a. 
We  can  apply  some  general  results  on  the  asymptotic  properties  of  MLEs  in  order  to  show  that  ^/n(a  —  a) 
is  asymptotically  normally  distributed  as  ra  — >  oo;  some  general  properties  of  MEEs  ensure  that  the  same 
hold  for  ^/n(a  —  a),  and  with  the  same  asymptotic  variance.  We  can  use  these  results  in  two  ways.  Eirst,  for 
models  of  loss  processes  with  typical  parameters,  we  can  estimate  the  number  of  probes  required  to  obtain 
an  estimate  with  a  given  accuracy.  Secondly,  we  can  estimate  the  likely  accuracy  of  a  from  actual  probe 
data  and  associate  confidence  intervals  to  the  estimates. 

The  fundamental  object  controlling  convergence  rates  of  the  MEE  a  is  the  Fisher  Information  Matrix  at 
a.  This  is  defined  for  each  a  G  (0, 1)^^  asthe7^t/-dimensionalrealmatrixIj7;(a)  :=  Cov  ( 

It  is  straightforward  to  verify  that  £  satisfies  conditions  (see  Section  2.3.1  of  [27])  under  which  I  is  equal 
to  the  following  more  convenient  expression  which  we  will  use  in  the  sequel: 

d^£  .  . 


da  j  dak 


[a] 


(17) 


On  the  other  hand,  a  direct  calculation  of  the  asymptotic  variance  of  a  follows  from  the  Central  Eimit 
Theorem.  The  random  variables  7  are  asymptotically  Gaussian  as  ra  — >  00  with 

Vra(7- 7)  ^  A/'(0,(t),  (18) 

^  ^  -J) 

where  Ujk  =  liinn_j.oo  ^  Cov(7j,  %),  for  j,  A;  G  U.  Here  — )■  denotes  convergence  in  distribution.  Since 
by  Theorem  1,  T"^  is  continuously  differentiable  on  Q,  then  by  the  Delta  method  (see  Chapter  7  of  [27]) 
a  =  r“^(7)  is  also  asymptotically  Gaussian,  so  establishing  the  first  part  of  the  following  theorem.  We 
note  that  the  matrices  v  and  F~^{a)  agree  on  the  interior  of  the  parameter  space,  but,  as  we  shall  see  below, 
1(a)  may  be  singular  on  the  boundary.  Eet  Dij{a)  =  (r(a))  and  denotes  the  transpose. 


Theorem  4  (i)  When  a/;  G  (0, 1],  A;  G  C,  then  as  n  ^  00, 

yfi  {a  —  a)  ff  {0,  a) ,  where  a  =  D{a)  ■  a  ■  (a) .  (19) 


(ii)  When  a  k  G  (0, 1),  A;  G  C  thenX{a)  is  non-singular  and  X  ^(a)  =  v. 

(Hi)  When  a^  G  (0, 1),  k  G  U,  ^/n  (a  —  a)  converges  in  distribution  as  n  ^  00  to  a  -dimensional 
Gaussian  random  variable  with  mean  0  and  covariance  matrix  X~^  [a). 


Theorem  4  enables  us  to  determine,  for  example,  that  asymptotically  for  large  n,  with  probability  1  —  5, 
the  a  will  lie  between  the  points 


ak  ±  zsi2 


(20) 
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where  2:^/2  denotes  the  number  that  cuts  off  an  area  5/2  in  the  right  tail  of  the  standard  normal  distrihution. 
This  is  used  for  a  confidence  interval  of  level  1  —  5.  As  we  are  interested  in  a  95%  confidence  interval  for 
single  link  measurements,  we  take  2^/2  ~  2. 


Confidence  Intervals  for  Parameters.  With  slight  modification,  the  same  methodology  can  he  used  to 
obtain  confidence  intervals  for  the  parameters  a  derived  from  measured  data  from  n  probes.  Following  [4] 
we  use  the  observed  Fisher  Information: 

Now,  the  proof  of  Theorem  2  (see  particularly  (57))  shows  that  the  dHjdak  depend  on  the  n{x)  only  through 
the  combinations  njk-  Hence  the  same  is  true  for  the  d^Cldajdak.  Since  P3[^2(A;)]  =  r(r  ^(7))/,  =  %, 
we  have  1(3)  =1(3). 

We  then  use  confidence  intervals  for  3/;  of  the  form 


3fc  ±  Zs/2 


(22) 


This  allows  us  to  find  simultaneous  confidence  regions  from  the  asymptotic  distribution  for  a  for  a  given 
tree.  An  issue  for  further  study  is  to  understand  how  the  confidence  intervals  change  as  the  tree  grows. 


Example:  Confidence  Intervals  for  the  Two-leaf  Tree  An  elementary  calculation  shows  that  the  inverse 
of  the  Fisher  information  matrix  governing  the  confidence  intervals  for  models  in  (20)  is 


r\a) 


( 


ai(a3-a2(l  +  df3(dfi-2))) 
a'2Q'3 


—  <^2  0-3 
<^3 


—  0-2  <^3 
<^3 


a'2Q'2 

aias 


V 


—  a'2Q'3 

0:2 


—  CX2  ^3 
ai 


—  a2  <^3 
0:2 


\ 


—  a2  <^3 

Off 


<A3df3  j 

CXl  Q'2  ^ 


(23) 


Here,  the  order  of  the  coordinates  is  ai,  02,  «3-  The  inverse  of  the  observed  Fisher  information  governing 
the  confidence  intervals  for  data  in  (22)  is  obtained  by  inserting  (14)-(16)  into  (23).  We  note  that  in  this 
case  I  is  singular  at  the  boundaries  02  =  1  and  03  =  1. 


4.2  Dependence  of  Loss  Estimator  Variance  on  Topology 

The  variance  of  3  determines  the  number  of  probes  which  must  be  used  in  order  to  obtain  an  estimate  of 
a  given  desired  accuracy.  Thus  it  is  important  to  understand  how  the  variance  depends  on  the  underlying 
topology.  Growth  of  the  variance  with  the  size  of  the  tree  might  preclude  application  of  the  estimator  to 
large  internetworks.  Long  timescale  instability  has  been  observed  in  the  Internet  [19];  if  the  timescale 
required  for  accurate  measurements  approaches  that  at  which  variability  occurs,  the  estimator’s  requirement 
of  stationarity  would  be  violated.  In  this  section  we  show  that  the  asymptotic  variance  z/  of  3  is  independent 
of  topology  for  loss  ratios  approaching  zero. 
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Figure  5:  Asymptotic  Estimator  Variance  and 
Branching  Ratio  Depth  2  tree,  2  to  7  leaves.  Variance 
decreases  towards  linear  approximation  1  —  a  as  branching 
ratio  increases. 


Figure  6:  Asymptotic  Estimator  Variance  and 
Tree  Depth  Binary  Tree  of  depth  2,  3  or  4.  Variance 
increases  with  tree  depth. 


The  following  theorem  characterizes  the  behavior  of  v  for  small  loss  ratio,  independently  of  the  topology 
of  the  logical  tree.  Set  ||a||  =  Set  5^k  =  lii  j  =  k  and  0  otherwise. 

Theorem  5  v^k  =  +  0(||a|p)  as  ||a||  ^  0. 

Theorem  5  says  that  the  variance  of  a  is,  to  first  order  in  a,  independent  of  topology.  However,  nothing 
is  said  about  higher  order  dependence,  and  in  particular  whether  the  difference  between  and  oikS^k 
converges  to  zero  uniformly  for  all  topologies  as  a  — >  0.  For  a  section  of  trees  we  used  computer  algebra  to 
calculate  the  maximum  asymptotic  variance  over  links  max/;  v^k  for  a  selection  of  trees,  as  a  function  of  the 
uniform  Bernoulli  probability  a/;  =  a.  We  use  the  notation  r(ri,  r2, . .  .,rn)  denote  the  tree  of  depth  ra  +  1 
(depth  =  maximum  level  I  of  any  leaf)  with  successive  branching  ratios  1,  ri,  r2, . . . ,  Vn,  i.e.  the  root  node 
0  has  the  single  descendent  node  1  which  has  ri  descendents,  each  of  which  has  r2  descendents,  and  so  on. 
We  show  the  dependence  on  branching  ratio  in  Figure  5  for  trees  of  depth  2.  In  these  examples,  increasing 
the  branching  ratio  decreases  the  variance.  In  Figure  6,  we  show  the  dependence  on  tree  depth  for  binary 
trees  of  depth  2,  3  and  4.  In  this  example,  estimator  variance  increases  with  tree  depth,  roughly  linearly. 
In  all  examples,  estimator  variance  is  approximately  linear  for  a  less  than  about  0.1,  and  independent  of 
topology,  in  keeping  with  Theorem  5.  For  larger  a  it  appears  from  these  examples  that  the  change  in 
estimator  variance  of  moving  from  simple  topologies  to  more  complex  ones  is  governed  by  two  opposing 
effects;  variance  reduction  with  increasing  branching  ratio,  and  variance  growth  with  increasing  tree  depth. 
The  reason  for  this  appears  to  be  that  increasing  the  branching  ratio  increases  the  size  of  R{k)  (the  set  of 
leaf-nodes  descended  from  k)  so  providing  more  data  points  for  estimation,  while  increasing  the  tree  depth 
increases  cumulative  error  per  link  in  estimation. 
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5  Data  Consistency  and  Parameter  Computation 


In  this  section  we  address  computational  issues  associated  with  the  estimator  a.  We  specify  consistency 
checks  which  must  he  applied  to  the  data  before  a  is  computed.  We  descrihe  an  algorithm  for  computa¬ 
tion  of  a  and  discuss  its  suitability  for  implementation  in  a  network,  in  particular  the  extent  to  which  it  is 
distributable. 

5.1  Data  Consistency 

In  this  section  we  describe  tests  for  consistency  of  the  empirical  probabilities  7  with  the  model.  The  valida¬ 
tions  of  the  methodology  carried  out  in  this  paper  are  all  within  controlled  simulations.  So  we  do  not  address 
here  the  additional  consistency  checks  which  would  be  required  for  applications  to  real  network  data,  such 
as  tests  for  stationarity. 

The  rest  of  this  section  focuses  on  range  checking  and  tree  surgery.  An  arbitrary  data  set  may 

not  give  rise  to  7  e  r((0, 1)^^).  If  this  is  because  some  of  the  %  take  values  0  or  1,  then  it  can  be  dealt 
with  by  reducing  the  tree.  In  particular,  when  one  of  the  7/;  is  0,  not  all  of  the  can  be  inferred  from  the 
data.  Those  which  cannot  must  be  removed  from  consideration.  In  other  cases,  the  data  is  not  consistent 
with  the  assumptions  that  loss  occurs  independently  on  different  links.  We  discuss  these  now. 

(i)  If  %  =  0  for  any  k  G  F,  we  construct  a  new  tree  by  deleting  node  k  and  all  its  descendants,  and 
perform  the  analysis  on  this  pruned  tree  instead.  We  are  unable  to  distinguish  between  the  various 
ways  in  which  may  be  zero,  e.g.  =  0,  or  a/;  >  0  but  =  0  for  children  j  G  d{k). 

(ii)  If  S;.  =  1  for  any  k  £  U  then  we  can  assign  probability  lioak-  Then,  for  the  purposes  of  calculation 
only,  we  consider  a  reduced  tree  obtained  by  excising  node  k  in  the  same  manner  as  nodes  with  a 
single  descendant  are  excised  from  the  physical  multicast  tree  to  generate  the  logical  multicast  tree; 
see  Section  2.1. 

(hi)  AnySfc  >  1  is  a  nonphysical  value,  since  the  link  probabilities  are  required  to  lie  in  [0, 1]  (subject  to  (i) 
and  (ii)  above).  Theorem  3  tells  us  this  will  not  occur  for  sufficiently  large  n.  Thus  in  implementations 
of  the  inference  algorithm,  this  event  may  be  used  to  trigger  the  dispatch  of  further  probes. 

(iv)  The  condition  7/;  =  Ylj^d(k)  7j  k  £  U  \  R  prevents  the  calculation  of  Ak  and  hence  also 

link  probabilities  for  links  that  include  k  as  a  vertex,  namely  cik  =  Akj A and  3^  =  A^j Ak 
for  j  G  d{k).  Instead,  we  estimate  only  the  probabilities  :  j  G  d{k)}  on  the  composite 

links  from  f{k)  to  the  elements  of  d{k),  estimating  0777  =  AjjAj^^i^'^,  j  G  d{k).  The  possibility 
Ik  >  d(k)  7fc-  precluded  by  the  relations  (25)  and  (26)  below.  Equality  occurs  only  if  the 
observed  losses  satisfy  the  strong  dependence  property  that  each  packet  reaching  a  receiver  in  R{k) 
reaches  no  other  receiver  in  R{k). 
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5.2  Computation  of  the  Estimator  on  a  General  Tree 

In  this  section  we  describe  the  algorithm  for  computing  3  on  a  general  tree.  An  important  feature  of  the 
calculation  is  that  it  can  be  performed  recursively  on  trees.  First  we  show  how  to  calculate  the  jk-  Denote  by 
{Xk{i))keR,i=i,2,...,n  the  measured  values  at  the  leaf  nodes  of  process  X  for  n.  Define  the  binary  quantities 
{Yk{i))kev,z=i,2,...,n  recursively  by 


Yk{i) 

—  Xk  (A) ,  k  £  R 

(24) 

Yk{i) 

=  V  keV\R 

(25) 

jed(k) 

% 

n 

=  n-^^Yk{i). 

(26) 

8  =  1 


For  simplicity  we  assume  now  that  7  G  r((0, 1)^^),  so  that,  if  necessary,  steps  (i)  and  (ii)  of  Section  5.1 
have  been  performed  on  the  data  and/or  the  logical  multicast  tree  in  order  to  bring  it  to  this  form.  The 
calculation  of  a  can  be  done  by  another  recursion.  We  formulate  both  recursions  in  pseudocode  in  Figure  7. 
The  procedure  find-gamma  calculates  the  Yk  and  jk,  assuming  Yk  initializes  to  X/;  for  A;  G  R  and  0 
otherwise.  The  procedure  infer  calculates  the  ak-  The  procedures  could  be  combined.  The  full  set  of  link 
probabilities  is  estimated  by  executing  main(l)  where  node  1  is  the  single  descendant  of  the  root  node  0. 

Here,  an  empty  product  (which  occurs  when  the  first  argument  of  infer  is  a  leaf  node)  is  understood  to 
be  zero.  We  assume  the  existence  of  a  routine  solvefor  that  returns  the  value  of  the  first  symbolic  argument 
which  solves  the  equation  specified  in  its  second  argument.  We  know  from  Theorem  1  that  under  the 
conditions  for  7  a  unique  such  value  exists. 

5.3  Implementation  of  Inference  in  a  Network 

The  recursive  nature  of  the  algorithm  has  important  consequences  for  its  implementation  in  a  network  set¬ 
ting.  Observe  that  the  calculation  of  jk  and  Ak  depends  on  X  only  through  the  {Yj)j^(i(^k)-  Put  another 
way,  if  j  is  a  child  of  k,  the  contribution  to  the  calculation  ofoik  of  all  data  measured  at  the  set  of  receivers 
R{i)  descended  from  j,  is  summarized  through  1)  .  In  a  networked  implementation  this  would  enable  the 
calculation  to  be  localized  in  subtrees  at  a  representative  node,  the  computational  effort  at  each  node  being 
at  worst  proportional  to  the  depth  of  the  tree  (for  the  node  that  is  the  representative  for  all  distinct  subtrees 
to  which  it  belongs). 

Moreover,  estimates  from  measurements  at  receivers  R{k)  descended  from  a  node  k  are  consistent  with 
those  from  the  full  set  of  receivers  in  the  following  sense.  Executing  main(A;)  yields  the  A  k  calculated  by 
main(l)  as  the  value  for  ak-  Thus  is  the  effective  probability  that  a  probe  traverse  a  (fictitious)  link  from  the 
root  0  directly  to  k.  But  when  the  full  inference  main(l)  is  performed,  it  is  not  hard  to  see  that  the  a  obey 
=  rijS  a ft^k)’  i-e  the  probability  of  traversing  the  path  from  0  to  A;  without  loss. 
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procedure  main  (k)  { 
flnd_gamma (k) ; 
infer  (  fc,  1 ) ; 

} 

procedure  find_gamma  (  )  { 
foreach  ( j  e  d{k)  )  { 

Yj  =  find_gamma  ( j  ) ; 
foreach  {i  e 

Yk[i]=Yk[i]VYj[i]  ; 

} 

} 

Ik  =n~^J2"=iYk[i\  ; 

return  Yk  ; 

} 

procedure  infer  {k,  A) ; 

Ak  =  solvefor(  Ak  %/Ak)  ==  ni6d(fe)(l  “  ); 

ttfe  =AklA  ; 

foreach  ( j  e  d{k)  )  { 
infer  {j  ,Ak)-, 

} 

} 


Figure  7:  PSEUDOCODE  FOR  INFERENCE  OF  LINK  PROBABILITIES 

6  Simulation  Results 

We  evaluated  our  inference  techniques  through  simulation  and  verified  that  they  performed  as  expected. 
This  work  had  two  parts:  model  simulations  and  TCP  simulations.  In  the  model  simulations,  losses  were 
determined  hy  time-invariant  Bernoulli  processes.  These  losses  follow  the  model  on  which  we  based  our 
earlier  analysis.  In  the  TCP  simulations,  losses  were  due  to  queue  overflows  as  multicast  prohes  competed 
with  other  traffic  generafed  hy  infinife  TCP  sources.  We  used  TCP  because  if  is  fhe  dominanf  fransporf 
profocol  in  fhe  Infernef  [29].  The  following  fwo  subsections  describe  our  resulfs  from  fhese  fwo  simulafion 
efforls. 

6.1  Model  Simulations 

Topology.  For  fhe  model  simulafions,  we  used  ad  hoc  soffware  wrillen  in  C-i-i-.  We  simulated  fhe  fwo 
free  fopologies  shown  in  Figures  3  and  4.  Node  0  sen!  a  sequence  of  mulficasf  probes  fo  fhe  leaves.  Each 
link  exhibited  packef  losses  wifh  temporal  and  spafial  independence.  We  could  configure  each  link  wifh  a 
differenl  loss  probabilify  fhaf  held  consfanf  for  fhe  durafion  of  a  simulation  run.  We  fed  fhe  losses  observed 
by  fhe  leaves  fo  a  separafe  Perl  scripl  fhaf  implemenfs  fhe  inference  calculafion  described  earlier. 
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Figure  8:  CONVERGENCE  OF  INFERRED  LOSS  PROBABILITIES  TO  ACTUAL  LOSS  PROBABILITIES  IN 
Model  Simulations.  Left:  Two-leaf  tree  of  Figure  3  with  parameters  or  =  0.02;  a2  =  as  =  0.05. 
Right:  Seleeted  links  from  four-leaf  tree  of  Figure  4,  with  parameters  ai  =  0.01;  as  =  0.1;  as  =  a4  = 
as  =  ag  =  0.01;  07  =  0.5.  The  graphs  show  that  inferred  probabilities  converge  to  within  0.01  of  the 
actual  probabilities  after  2,000  or  fewer  observations. 

Convergence.  Figure  8  compares  inferred  packet  loss  probabilities  to  actual  loss  probabilities.  The  left 
graph  shows  results  for  all  three  links  in  our  two-leaf  topology,  while  the  right  graph  shows  results  for 
selected  links  in  the  four-leaf  topology.  In  all  cases,  the  inferred  probabilities  converge  to  within  0.01  of  the 
actual  probabilities  after  2,000  observations. 

Figure  9  compares  the  empirical  and  theoretical  95%  confidence  intervals  of  the  inferred  loss  proba¬ 
bilities  for  the  two-leaf  topology.  The  empirical  intervals  were  calculated  over  100  simulation  runs  using 
100  different  seeds  for  the  random  number  generator  that  underlies  the  Bernoulli  processes.  The  theoretical 
intervals  are  as  predicted  by  (20).  As  shown,  simulation  matches  theory  extremely  well  -  we  show  the  two 
graphs  separately  because  the  two  sets  of  curves  are  indistinguishable  when  plotted  together.  For  2,000 
observations,  the  confidence  infervals  lie  wifh  wifhin  20%  of  fhe  frue  probabilities. 

If  may  seem  fhaf  fhousands  of  probes  consfifule  loo  many  nelwork  resources  lo  expend  and  loo  long  lo 
wail  for  a  measuremenl.  However,  il  is  imporlanl  lo  note  lhal  a  slream  of  200-byte  packels  every  20  ms 
represenls  only  10  Kbps,  equivalenl  lo  a  single  compressed  audio  Iransfer.  Furlhermore,  a  measuremenl 
using  5,000  such  packels  lasls  less  lhan  Iwo  minutes.  There  already  exisl  a  number  of  MBone  “radio” 
slalions  lhal  send  long-lived  slreams  of  sequenced  mullicasl  packels.  In  some  cases  we  can  use  Ihese  existing 
mullicasl  slreams  as  measuremenl  probes  wilhoul  additional  cost  Overall,  we  feel  lhal  mullicasl-based 
inference  is  a  practical  and  robusl  way  lo  measure  nelwork  dynamics. 
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Figure  9:  Agreement  between  Simulated  and  Theoretical  Confidence  Intervals.  Left: 
Results  from  100  model  simulations.  Right:  Predictions  from  (20).  The  graphs  show  two-sided  confidence 
estimates  at  2  standard  deviations  for  link  2  of  the  four-leaf  tree  of  Figure  4.  Parameters  were  cfi  =  0.01; 
02  =  0.1;  as  =  7x4  =  ^5  =  ae  =  0.01;  07  =  0.5.  Simulation  matches  theory  extremely  well  -  the  two  sets 
of  curves  are  indistinguishahle  when  plotted  in  the  same  graph. 

6.2  TCP  Simulations 


Topology.  For  the  TCP  simulations,  we  used  the  ns  network  simulator  [18].  We  configured  ns  fo  simulafe 
free  topologies  shown  in  Figures  3,  4  and  11.  All  links  had  1.5  Mbps  of  handwidfh,  10  ms  of  propagafion 
delay,  and  were  served  hy  a  FIFO  queue  wifh  a  4-packel  limif.  Thus,  a  packef  arriving  af  a  link  was  dropped 
when  if  found  four  packefs  already  queued  af  fhe  link. 

In  each  fopology,  node  0  sen!  mulficasf  prohe  packefs  generafed  hy  a  source  wifh  200-hyfe  packefs  and 
inferpackef  fimes  chosen  randomly  hefween  2.5  and  7.5  msec.  The  leaf  nodes  received  fhe  mulficasf  packefs 
and  monifored  losses  hy  looking  for  gaps  in  fhe  sequence  numbers  of  arriving  probes.  We  fed  fhe  losses 
observed  by  fhe  mulficasf  receivers  fo  fhe  same  inference  implemenfafion  used  for  fhe  model  simulafions 
described  above.  We  also  had  ns  reporf  losses  on  individual  links  in  order  fo  compare  inferred  losses  wifh 
acfual  losses. 

In  fhe  fwo-  and  four-receiver  fopologies,  each  node  mainfained  TCP  connections  fo  ifs  child  nodes. 
These  connecfions  used  fhe  Tahoe  varianf  of  TCP,  sen!  1,000-byle  packefs,  and  were  driven  by  an  infinife 
dafa  source.  Links  fo  leff  children  carried  one  such  TCP  sfream,  while  links  fo  righf  children  carried  fwo 
TCP  sfreams.  The  link  hefween  nodes  0  and  1  also  carried  one  TCP  sfream. 

In  fhe  eighf-receiver  fopology,  fhe  fraffic  more  more  diverse,  wifh  52  TCP  connecfions  hefween  differenf 
pairs  of  nodes,  giving  rise  fo  approximafely  8  connecfions  per  link  on  average. 

Convergence.  Figure  10  compares  inferred  loss  rales  to  acfual  loss  rales  on  selecled  links  of  our  fwo-  and 
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Figure  10:  Tracking  of  Actual  Loss  Rates  by  Inferred  Loss  Rates  in  TCP  Simulations. 
Left:  Two-leaf  tree  of  Figure  3.  Right:  Selected  links  from  four-leaf  tree  of  Figure  4  (some  pairs  of 
probabilities  are  offset  for  clarity).  The  graphs  show  that  the  inferred  loss  rates  closely  track  the  actual  loss 
rates  over  10,000  observations. 
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Figure  11:  TRACKING  OF  Actual  Loss  Rates  by  Inferred  Loss  Rates  in  TCP  Simulations 
WITH  Diverse  Background  Traffic.  Left:  Eight-leaf  binary  tree.  Right:  Close  tracking  of  actual 
loss  rates  by  estimated  loss  rates  as  number  of  observations  is  increased  up  to  1,000. 

four-leaf  topologies.  As  shown,  the  inferred  rates  closely  track  the  actual  rates  over  10,000  observations. 
Figure  1 1  compares  inferred  and  actual  loss  rates  in  the  eight-receiver  topology  with  diverse  background 
traffic;  in  this  case  the  tracking  is  even  closer. 

We  note  that  the  inferred  values  are  accurate  even  though  queue  overflows  due  to  TCP  interference 
do  not  obey  our  temporal  independence  assumption.  TCP  is  a  bursty  packet  source,  particularly  in  the 
region  of  exponential  window  growth  during  a  slow  start  [9].  In  our  simulations,  multicast  probes  are  often 
lost  in  groups  as  they  compete  for  queue  space  with  TCP  bursts.  This  phenomenon  is  readily  apparent 
when  watching  animations  of  our  simulations  with  the  nam  tool  [17].  Inspection  of  the  autocorrelation 
function  of  the  time  series  of  packet  losses  for  a  series  of  experiments  predominantly  showed  correlation 
indistinguishable  from  zero  beyond  a  lag  of  1  (i.e.  greater  than  back-to-back  losses).  As  we  explain  in  more 
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Figure  12:  ACCURACY  OF  INFERENCE  IN  TCP  Simulations.  Left:  Two-leaf  tree  of  Figure  3.  Right: 
Four-leaf  tree  of  Figure  4.  The  graphs  show  normalized  root  mean  square  differences  between  actual  and 
inferred  loss  rates,  computed  across  100  simulations.  After  an  initial  transient,  inferred  loss  rates  settle  down 
to  within  8  to  15%  (in  the  two-leaf  tree)  and  4  to  18%  (in  the  four-leaf  tree)  of  actual  loss  rates,  depending 
on  the  link.  The  RMS  error  was  reduced  to  approximately  1  %  hy  modifying  the  MLEs  to  correct  for  spatial 
loss  dependence. 

detail  in  Section  8,  the  estimator  a  is  still  asymptotically  accurate  for  large  numbers  of  probes  when  losses 
have  temporal  dependence  of  sufficiently  short  range.  However,  the  rate  of  convergence  of  the  estimates  to 
their  true  values  will  be  slower. 

Figure  12  shows  the  Root  Mean  Square  (RMS)  differences  between  the  inferred  and  actual  loss  rates 
in  the  two-  and  four-leaf  topologies.  These  differences  were  calculated  over  100  simulation  runs  using  100 
different  seeds  for  the  random  number  generator  that  governs  the  time  between  probe  packets.  As  shown, 
the  differences  can  drop  significantly  during  the  first  2,000  observations.  However,  at  some  point  they  level 
off  and  do  not  drop  much  further,  if  at  all.  This  persistence  reveals  a  systematic,  although  small,  error  in  the 
inferred  values  because  of  spatial  loss  dependence.  In  our  simulations,  the  same  multicast  probe  is  lost  on 
sibling  links  more  often  than  the  spatial  independence  assumption  dictates.  These  dependent  losses  lead  the 
inference  calculation  to  underestimate  losses  on  the  sibling  links  and  to  overestimate  losses  on  the  parent 
link. 

We  can  quantify  the  spatial  loss  dependence  present  in  the  simulations.  We  can  also  calculate  the  ef¬ 
fect  of  such  dependence  on  the  inferred  loss  probabilities  by  extending  our  previous  analysis.  Thus  a  prior 
estimate  of  the  degree  of  dependence  could  be  used  to  obtain  corrections  to  the  Bernoulli  inference.  We 
discuss  this  in  more  detail  for  spatial  dependence  in  Section  7  and  give  an  example  of  how  to  apply  the 
correction.  Applied  to  the  inferences  on  the  two-leaf  tree  summarized  in  Figure  10,  they  reduce  an  RMS 
error  of  between  8  and  15%  to  one  of  around  1%.  The  key  observation  behind  these  analyzes  is  that  the 
error  in  the  inferred  values  varies  smoothly  with  the  degree  of  spatial  dependence.  The  greater  the  depen¬ 
dence  in  the  network,  the  larger  the  error.  We  can  arrange  for  correlated  losses  in  a  simulated  network. 
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for  example  by  creating  synchronized  interference  streams  on  sibling  links.  However,  the  results  for  the 
eight-receiver  topology  with  diverse  background  traffic  support  our  belief  that  large  and  long-lasting  spatial 
loss  dependence  is  unlikely  in  real  networks  like  the  Internet  because  of  their  traffic  and  link  diversity. 

7  The  Analysis  and  Correction  of  Spatial  Dependence 

7.1  Analysis  of  Spatial  Dependence 

When  spatial  dependence  present  in  packet  losses,  the  Bernoulli  model  assumption  is  violated.  But  even 
with  such  dependence,  we  can  still  ask  what  are  the  marginal  loss  probabilities  for  each  link  separately. 
In  this  section  we  quantify  the  effects  of  this  dependence  and  show  how  they  may  be  corrected  for  on  the 
basis  of  a  priori  knowledge  of  them.  We  propose  that  this  knowledge  should  be  obtained  by  independent 
measurements  on  instrumented  networks.  Moreover,  we  establish  that  dependence  deforms  the  Bernoulli 
estimates  continuously  in  the  sense  that  small  divergences  from  independence  of  the  losses  lead  to  small 
divergence  of  the  estimates  of  the  marginal  loss  probabilities  from  their  true  values.  For  binary  trees  we  find 
that  the  effect  of  such  dependence  on  the  estimates  of  marginal  loss  probabilities  for  links  in  the  interior 
of  the  network  is  second  order,  and  become  negligible  in  regions  of  the  network  across  which  loss  and 
dependence  change  little. 

One  motivation  for  considering  dependent  losses  comes  from  the  well-known  example  of  synchroniza¬ 
tion  between  TCP  flows  which  can  occur  as  a  result  of  the  slow-start  after  packet  loss;  see  [9].  Flows 
which  have  experienced  common  loss  on  a  link  k  will  then  have  some  degree  of  dependence.  Viewed  as 
background  traffic  against  which  the  probe  packets  compete,  they  can  be  expected  to  give  rise  to  dependent 
losses  of  probe  packets  on  links  on  the  subtree  descended  from  k.  However,  the  dependence  of  probe  loss 
can  be  expected  to  decrease  on  progressing  down  the  tree  from  k.  This  happens  if  we  assume  that  flows 
which  became  dependent  though  losses  a  given  node  k  typically  have  a  spread  of  destination  address;  then 
their  paths  through  the  network  will  subsequently  diverge.  Then  the  fraction  of  the  total  traffic  contributed 
on  links  descended  from  k  will  decrease  on  progressing  down  the  tree  from  A;;  hence  the  dependent  influence 
of  such  flows  on  probe  loss  will  decrease  likewise. 

The  foregoing  discussion  motivates  us  to  capture  such  dependence  to  first  order  by  considering,  within 
the  class  of  dependent  loss  processes,  those  for  which  dependence  only  occurs  between  losses  on  sibling 
links,  i.e.,  between  those  and  Xji  for  which  /(j)  =  f{j').  Let  A  =  {{ji, . .  -jn}  C  d{k),  k  £  V  \  R} 
denote  the  set  of  subsets  of  sibling  links.  We  characterize  the  joint  distribution  of  the  (A  k)  kev  through  the 
family  of  joint  conditional  probabilities  where  for  k  =  f{ji)  =  . . .  =  f{jn), 

“jlv-jn  :=  ^[^Jl  =  =  =  1]  (27) 

(For  Bernoulli  loss,  We  now  derive  analogous  relations  to  (6)  in  this  case.  It  is 
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convenient  to  work  initially  with  the  quantities 

a  :=  P[^^(A;)  I  Xfc  =  1]  =  |  Xj^^)  =  1]/P[^fc  =  1  I  ^/(fc)  =  1]  =  /3fc/«fc  (28) 

For  n  <  #d{k)  let  dn{k)  denote  the  set  of  subsets  of  d{k)  of  cardinality  n.  By  the  Inclusion-Exclusion 
Principle  (see  e.g.  Chapter  5.2  of  [25]) 

H^d(k) 

P[f^(A;)]  =  P  [^jedik)^]  =  E  P  n  . . .  n  Q{jn)] ,  (29) 

"-=1  {jl,---,jn}Cd„(k) 

from  which  we  find  using  (27)  and  (28)  that 

*d{k) 

E  (30) 

"-=1  {jl,---,jn}cd„{k) 

Reexpressed  in  term  of  the  jk  we  obtain  the  following  analog  of  (10)  for  A;  G  U  \  R: 

H^d{k) 

HkiAkrf,^)  :=  Ik/Ak  -  E  =  0  (31) 

>^=1  {n,-,3n}Cdn{k)  ^ 

where  ai,-jn  =  •  •  -“jn)  ^nd  we  write  ip  =  (ai,-jn){ji,...j„}eA-  For  a  given  loss  model 

one  can  in  principle  compute  ^  and  compute  Ak  from  Rather  than  do  this,  however,  we  establish  some 
structural  results. 

We  can  compare  the  actual  values  Ak{'ild)  which  solve  (31)  for  Ak,  with  those  obtained  from  (10)  with 
the  Bernoulli  assumption,  which  we  can  write  as  A/;  (1) .  The  following  theorem  shows  that  the  deformation 
from  Ak{'ild)  to  Ak{l)  is  continuous  in  the  neighborhood  of  the  Bernoulli  values  ^  =  1  (i.e.  =  1 

for  all  {ji, . .  .,j„}  G  A). 

Theorem  6  Letctk  >  0.  There  exists  a  neighborhood  of  tf)  =  1  on  which  f  i— ?■  Ak{f’)  is  continuous. 

Proof  of  Theorem  6:  The  result  then  follows  from  the  Implicit  Function  Theorem  (see  [26])  provided  that 
dAkHk{Ak{l)x,l)  /  0.  But  iffc(Afc,7, 1)  =  Hk{AkX)  =  H^k/Ak,{^jhk  ■  j  e  d{k)})  appearing 
in  (10)  and  Lemma  1,  and  so  the  result  follows  from  dxh{x{c),  c)  <  0  as  established  during  the  proof  of 
Lemma  1 .  ■ 

7.2  Spatially  Dependent  Losses  in  Binary  Trees 

When  T  is  a  binary  tree  we  can  obtain  explicit  results.  For  A;  G  t/\i?  write 
Then  from  (31)  we  have 

Ak,  k  (z  R 

ij  + 1]' +  keu\R 


=  fjj,  where  d( A;)  =  {j,f} 

(32) 
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Let  a(t/’)  be  the  true  value  of  a,  i.e.  that  obtained  by  combining  (32)  with  (11).  a(l)  is  then  the  value 
previously  obtained  using  the  Bernoulli  assumption.  Let  k  =  1  denote  the  single  descendent  of  the  root 
node  0. 

Theorem  7  Let  T  be  a  binary  tree. 

( i)  There  is  a  bijection  L ^  from  A  to  Q  such  that  (7)  =  a{f  ),  with  Li  =  L  from  Theorem  1. 

(ii) 

(  k=l 

Mf)=\  k^R  (33) 

[  otherwise 

Proof  of  Theorem  7:  From  (32),  Ak{f)  =  (7^  +  7^'  —  Tk) / The  form  of  (ii) 
then  follows  from  (1 1);  this  is  used  as  the  definition  of  for  (i).  | 

Theorem  7(ii)  has  the  interesting  interpretation  that  in  the  interior  of  the  network  (i.e.  except  for  node  1 
and  the  leaf-nodes)  the  error  in  using  ak{f)  in  place  of  a/;  (1)  is  a  second  order  effect.  For  the  error  depends 
only  on  the  on  the  relative  magnitude  of  correlations  at  adjacent  nodes  through  the  quotient  f  If 

the  link  probabilities  and  dependencies  are  (approximately)  equal  at  each  node  of  the  tree,  then  this  quotient 
will  be  (approximately)  one,  and  so  the  Bernoulli  estimate  Sfc  (1)  :=  Tj”^  (7)  will  be  (approximately)  equal 
to  r^^(7),  for  interior  k.  Thus  we  see  that  the  presence  of  dependent  losses  in  binary  trees  perturbs  the 
Bernoulli -based  estimator  little  for  links  within  the  interior  of  regions  across  which  the  degree  of  dependence 
is  similar.  On  the  other  hand,  at  the  boundaries  between  such  regions,  a  priori  knowledge  of  the  degree  of 
dependence  can  help  make  the  estimates  more  accurate.  This  motivates  future  work  both  in  simulation 
studies  and  instrumentation  of  heterogeneous  networks  in  order  to  establish  the  degree  of  dependence  is 
influenced  by  dynamic  factors  such  as  utilization,  and  (comparatively)  static  factors  such  link  technology 
and  relative  link  speeds. 

It  is  interesting  to  see  that  the  TCP  Simulations  of  the  4-leaf  tree  display  some  of  the  features  one  might 
expect  from  the  above  discussion.  Observe  in  the  RHS  of  Figure  10  that  for  the  leaf-links  (6  and  7)  the 
inferred  loss  rate  underestimates  the  actual  loss  rate,  while  for  link  1  it  overestimates  it.  For  the  interior  link 
3,  the  inferred  and  actual  values  are  almost  identical.  This  is  consistent  with  the  above  discussion  if  fk  >  1 
and  fs  ^  ^j(3)  =  fi.  Note  that  for  d{k)  =  {jj'}, 

fk  >  1  ^  I  ^k  =  1]  >  I  ^k  —  |  X};  =  1].  (34) 

In  other  words,  >  1  iff  and  Xji  are  (conditional  on  Xk  =  f)  positively  correlated.  We  expect  this  to 
be  the  case  when  synchronized  losses  occur  as  described  at  the  start  of  this  section. 


40 


RMS  difference  from  actual  loss 

adjusted 

original 

link  1 

nMH« 

0.142 

link  2 

0.114 

link  3 

0.007 

0.089 

Table  1:  CORRECTING  FOR  SPATIAL  DEPENDENCE:  RMS  proportional  difference  of  inferred  from  actual 
losses  in  ns  simulation  of  two-leaf  tree  in  Figure  3,  after  10,000  probes.  Adjustment  of  inference  to  account 
for  dependence  (left  column)  shows  order  of  magnitude  improvement  over  original  inference  (right  column) 


7.3  Correction  for  Spatial  Dependence  in  Binary  Trees 


If  some  knowledge  of  the  degree  of  dependence  in  the  traffic  is  available,  then  this  can  be  used  to  adjust 
the  inferred  loss  probabilities  accordingly.  This  motivates  experimental  studies  of  real  networks  with  instru¬ 
mented  links  in  order  to  ascertain  the  magnitude  of  the  dependence.  We  intend  to  undertake  these  experi¬ 
ments  in  the  future.  Here  we  show  how  knowledge  of  dependence  can  be  used  to  correct  the  Bernoulli-based 
estimates  of  link  probabilities  for  non-interior  nodes.  We  consider  the  set  of  leaf-nodes  {j,  j '}  G  d{k).  Let 
Yj  have  the  the  distribution  of  Xj  conditioned  onXk  =  1 .  Suppose  we  know  a  priori  an  estimate  k  for  the 
correlation  of  and  Ij/ .  Now  the  theoretical  value  of  the  correlation  is 


c°v(y,,y,0 

VVar(y,)Var(y,) 


Ojjl  CKjCKj’  _  f  I  —  /QjQj'  \ 

ajoijajiaji  V  V  ®i®i'  / 


(35) 


Thus  we  expect  to  improve  our  estimates  Sj  (1)  by  using  3^  instead  where  is  obtained  from  (35) 

by  using  k  and  3(1)  in  place  of  k  and  a. 

To  test  this  approach,  we  measured  the  loss  dependence  in  an  ns  simulation  of  10,000  probes  in  the 
two-leaf  tree,  then  conducted  100  further  ns  simulations  of  10,000  probes,  and  adjusted  the  inferred  link 
probabilities  in  this  manner.  Comparing  the  actual,  adjusted,  and  originally  inferred  loss  ratios  we  see  this 
provides  improvement:  the  root  mean  square  error  goes  down  from  between  8  and  15%  (depending  on  the 
link)  to  about  1%  in  this  case;  see  Table  1. 


8  Temporal  Dependence  and  Convergence  Rates 

8.1  Ergodicity  and  Asymptotic  Accuracy 

In  this  section  we  investigate  the  impact  of  temporal  dependence  on  the  estimator  3.  Denote  by  X  (ra)  = 
{Xk{n))kev  the  (spatial)  process  of  the  probe.  The  first  observation  is  that,  if  we  replace  the  assumption 
of  independence  between  probes  to  merely  assuming  that  the  (temporal)  process  {X  (ra))„gN  is  stationary 
and  ergodic,  then  3  still  converges  to  a  almost  surely  as  the  number  of  observations  grows  to  oo.  This  is 
because,  by  definition,  the  observed  probabilities  7  of  the  ergodic  process  converge  almost  surely  to  the  long 
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term  averages.  By  stationarity,  these  are  just  the  7  =  r(a)  as  before,  where  the  a  are  the  (time)-marginal 
distrihutions  of  the  link  prohahilities.  A  simple  argument  involving  the  Inverse  Function  Theorem  (e.g., 
see  [26])  shows  that  is  continuous  on  r((0, 1)^^),  and  hence  3  — >  a  almost  surely.  Note  we  do  not 
rely  on  a  being  the  maximum  likelihood  estimator,  with  respect  to  some  parameter  space,  for  the  marginal 
probabilities  a  of  the  general  process.  Rather,  we  have  shown  that  the  Bernoulli  estimator  is  asymptotically 
accurate  for  stationary  ergodic  processes. 

In  the  remainder  of  this  section  we  examine  the  rate  of  convergence  when  X  possesses  temporal  depen¬ 
dence.  In  an  application  of  the  method  to  measurement  on  real  networks  however,  inherent  variability  (due 
do  large  scale  events  such  as  routing  changes)  may  impose  limits  on  the  durations  over  which  we  can  expect 
the  loss  process  to  be  stationary.  For  this  reason  it  is  important  to  understand  in  more  detail  the  impact  of 
time-dependent  packet  loss  on  convergence  rates.  We  propose  to  examine  this  through  models.  Markovian 
models  of  packet  loss  have  been  proposed  on  the  basis  of  observations  of  the  Internet  (e.g.,  see  [1]),  although 
some  longer  bursts  of  losses  were  also  found.  We  shall  see  that  the  price  of  temporal  dependence  is  slower 
convergence  than  for  the  Bernoulli  case.  One  can  understand  this  qualitatively  from  the  fact  that  burstiness 
in  the  packet  loss  processes  means  that  the  long-term  average  of  7  takes  longer  to  approach. 

8.2  Convergence  Rates  for  Markovian  Congestion 

The  main  tool  in  understanding  convergence  rates  is  the  following.  Let  T  denote  the  node  k  component 
of  T"^,  so  that  Oik  =  rfc^(7)-  Suppose  now  that  the  random  variables  7  are  asymptotically  Gaussian  as 
ra  — >  00  with 

Vra(7- 7)  ^  A/'(0,(t),  (36) 

^  ^  ■J) 

where  Ujk  =  liinn_j.oo  nCov(7j,  %),  for  j,  k  £  U.  Here  — )■  denotes  convergence  in  distribution.  Then  by 
the  Delta  method  (see  Chapter  7  of  [27]),  since  Tj)^  is  continuously  differentiable  on  Q  (see  Theorem  1), 
Tj)^  (7)  is  also  asymptotically  Gaussian: 

-  0!k)  ^  where  Uk  =  VT~^  (j)  ■  a  ■  VT~^  (j) .  (37) 

In  the  remainder  of  this  section  we  establish  (36)  within  the  context  of  Markov  loss  processes,  and  perform 
some  explicit  calculations  for  the  2-leaf  tree. 

We  expand  the  class  of  loss  processes  as  follows.  We  will  define  a  Markov  process  {Y  (ra))„gN,  where 
Y  (ra)  will  describe  the  state  of  the  network  encountered  by  the  probe;  this  description  is  used  whether, 
for  example,  the  interprobe  times  are  constant,  variable  or  random.  Y  is  constructed  as  follows.  For  each 
k  £  U  let  (Yfc(ra))„gN  be  an  independent  Markov  process  on  the  state  space  {0, 1}.  We  think  of  Yk{n)  as 
representing  the  state  of  link  k  at  time  n,  taking  the  value  0  if  the  link  is  congested,  1  if  it  is  not.  A  probe 
that  encounters  a  congested  link  is  lost.  We  represent  this  by  the  process  X  =  (X k{n)) keu,nefit  defined  by 
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lettingXfc(ra)  be  conditionally  independent  of  (Xj(m),  given  (Xj(fc)(ra),  Yfc(ra)),  with 

(w(..)iAv,„w,n(„))  =  {;-^,_^^  =  ;  (38) 

When  Ifc  (•)  is  Bernoulli  with  probability  to  be  in  the  state  1,  then  the  X  (ra)  are  independent  for  each  n, 
with  the  Xk{n)  distributed  as  described  in  Section  2.2.  X  is  not  a  Markov  process,  but  rather  is  a  function 
of  the  Markov  process  Y.  Moreover,  X  (ra)  is  a  some  function  of  Y  (ra)  alone,  which  we  denote  by  x-  For 
each  k  £  U,  let  <i>(A;)  be  the  set  of  configurations  y  of  Y  such  that  x(y)  has  outcome  x{y)  {r)  in  Q{k),  i.e., 

m  =  {y  e  {0, 1}#^ :  x{y)iR)  e  n{k)}.  (39) 

Let  Q  denote  the  transition  matrix  for  Y,  i.e.,  Q  =  Y>keuQ{k)  is  the  Kronecker  product  of  the  transition 
matrices  of  the  individual  Yfc .  Letg(A;)  =  {1  —  a^}  andletg  =  Y>keuy{k)  be  the  corresponding  product 

distribution. 


Theorems  With  the  above  notation,  assume  a  k  G  {0,l)forallk  G  U.  Then  (37)  holds  with 

OO 

hyi^yz  ~  hz)  +  2  ^  ^yiQyz  ~  yy)yz  j 
m  =  l 

where  Q™  denotes  the  m-step  transition  matrix. 


=  Y. 


j)  zP^tk) 


(40) 


Observe  that  in  the  Bernoulli  case,  the  second  term  in  (79)  vanishes,  while  the  first  depends  only  on  the 
marginal  probabilities  a.  This  means  that  the  first  term  in  (79)  gives  rise  to  the  diagonal  elements  of  (23); 
in  what  follows  we  can  thus  restrict  our  attention  to  the  increase  in  the  asymptotic  variance  as  specified  by 
the  second  term. 

We  parameterize  the  transition  matrix  of  Y/,  as 

Q(k)  =  f  ^ ,  (41) 

V  1  -  akc^k  J 

where  Uk  G  (0, 1/  max{a/;,  a/;}].  Uk  parameterizes  the  burstiness  of  Yk  without  changing  its  marginal 
probabilities.  Yk{m)  and  Yfc(m  +  1)  are  positively  (or  negatively)  correlated  when  Uk  >  0  (or  Uk  < 
0).  When  Uk  =  0,  Y^  is  Bernoulli.  By  calculation  of  the  matrix  powers  of  Q{k)  through  its  spectral 
decomposition,  we  find  that  Q”  {k)yzqz  {k)  is  given  by  the  matrix 

[Q'^{k)yz]  =  0JlF{k)+G{k),  where  F{k)  =  Okoik  (  )  ,  G{k)  =  q{k)  ®  q{k).  (42) 

Expanding  Q”  =  GkeuQ^  {k)  and  summing  over  n  we  find 

OO 

g{W)  {(GkewF{k))  ®  {Ykeu\wG{k))  ,  (43) 

m=l  W CU 

where  £((0)  =  0  and  otherwise  £r(fF)  =  {llkew‘^k)/{f  -  Ilkew‘^k)- 
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8.3  Example:  the  Two-leaf  Tree 

Taking  gradients  in  (14)-(16)  and  reexpressing  them  in  terms  of  a  we  find 


“3’  “2)  V7T^-1 


vrr^(r(«))  = 


vr-^(r(«))  = 


vr3-Hr(«))  = 


(^2^3  (Xi(X2 

Using  the  notation  (abc),  with  a,b,c  £  {0, 1},  to  denote  a  value  of  U (ra),  we  have  from  (13): 


(44) 


<i>(l)  =  {(111),  (110),  (101)},  <i>(2)  =  {(111),  (110)},  <i>(3)  =  {(111),  (101)}. 


(45) 


For  simplicity  we  set  the  ak  and  equal  to  a,  lo.  Then  (43)  becomes 


00  3 

E  ®  ^(2)  ®  F(3)  (46) 

7Tl=l 

(jj‘^ 

(^(1)  ®  ^(2)  ®  ^(3)  +  F(l)  ®  G(2)  ®  F(3)  +  G(l)  ®  F(2)  ®  F(3)) 

(^(1)  ®  ^-(2)  ®  G(3)  +  G(l)  ®  G(2)  ®  F(3)  +  G(l)  ®  F(2)  ®  G(3)) . 

1  —  (jJ 


Combining  (44),  (45)  and  (46)  in  (37)  in  (46)  with  Theorem  8 

a  —  a[l  +  a[a  —  2)) 


T“1  — 

^11  ~ 


X~^  —  X~^  —  — 

_Z.-00  -  -^QQ  - 
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-1 
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-22 


33 


a 


I  +  aijj  +  +  a^cj^) 

^  11  “1“  Q,(X  (X  _  ^3^ 

_  _T-1  aLj{{a  + Ljy  +  +  Lj)) 

V2  —  —  7-22  H - T-p- — rp; - p; - 

a[l  +  cj)  (1  — 


(47) 

(48) 

(49) 

(50) 


From  (42),  u)  is  the  geometric  decay  rate  of  correlations.  We  can  interpret  r  =  1/(1  —  cj)  as  the  mean 
correlation  time  of  the  losses;  r  =  1  for  Bernoulli  losses.  In  Figure  13  we  display  the  increase  in  asymptotic 
variance  by  plotting  the  ratio  vij X^^  of  the  asymptotic  variance  with  Markovian  correlations  to  that  without. 
We  do  this  for  a  G  [.5, 1]  and  r  G  [1,10].  V2IX22  displayed  very  similar  behavior.  The  ratio  is  increasing 
in  correlation  time  r,  and  in  the  link  transmission  probability  a. 


8.4  Temporal  Dependence  and  Probing  Methodology 

An  approach  to  avoiding  the  effect  of  temporal  dependence  would  be  to  time  probes  at  intervals  larger  than 
the  typical  correlation  time  of  losses.  Although  this  will  reduce  the  number  of  probes  required  for  a  given 
level  of  convergence,  the  absolute  time  of  convergence  may  increase  due  to  the  increased  time  between 
probes.  Increasing  the  probes  spacing  by  a  factor  r',  but  with  all  probes  lying  within  a  given  measurement 
period  would  increase  the  variance  of  the  estimates  by  a  factor  t'  for  independent  losses.  With  Markovian 
losses,  the  effect  of  dependence  between  probes  could  be  ameliorated  by  taking  r'  >  r,  the  correlation 
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Figure  13:  IMPACT  OF  TEMPORAL  Dependence  on  Convergence  of  Estimates:  The  ratio 
of  the  asymptotic  variances  of  3i  with  and  without  temporal  dependence.  Ratio  is  increasing  in  correlation 
time  r,  and  in  link  transmission  prohahility  a. 

time.  But  for  the  two-leaf  tree  we  see  from  (47)  that  when  a  — >  1,  then  Vkll-kk  — ^  1/(1  —  ^^)  =  t  for 
k  =  1,2,3.  Thus  for  small  loss  prohahilities,  the  slow-down  in  the  rate  of  convergence  of  a  is  no  worse  than 
that  obtained  hy  spacing  prohes  to  he  approximately  independent.  In  this  example  then,  one  may  as  well  use 
all  prohes  irrespective  of  their  mutual  dependence,  rather  than  try  to  space  them  out  to  avoid  dependence. 

We  envisage  that  direct  measurement  of  the  correlation  time  of  received  prohes  could  he  used,  in  combi¬ 
nation  with  calculations  of  the  previous  section,  to  determine  the  number  of  probes,  in  an  ongoing  measure¬ 
ment,  that  are  required  in  order  to  infer  the  link  probabilities  for  a  given  accuracy.  In  the  example  considered 
we  have  seen  that  in  order  to  estimate  the  increase  in  the  asymptotic  variation  due  to  dependence  between 
losses  of  small  probability,  it  is  sufficient  to  determine  the  correlation  time  of  observed  losses.  When  losses 
are  heterogeneous,  this  will  be  conservative,  since  the  autocorrelation  will  be  dominated  by  the  component 
with  slowest  decay. 

A  related  issue  is  the  randomization  of  interprobe  times  in  order  to  avoid  bias  in  the  selection  of  network 
states  which  are  observed  via  the  probes.  Probes  with  exponentially  distributed  spacings  will  see  time 
averages;  this  is  the  PASTA  property  (Poisson  Arrivals  See  Time  Averages;  see  e.g.  [32]).  This  approach 
has  been  proposed  for  network  measurements  [23]  and  is  under  consideration  in  the  IP  Performance  Metrics 
working  group  of  the  IETF  [8].  In  the  context  of  the  above  discussion,  lengthening  the  interprobe  time  is  to 
be  understood  as  increasing  the  mean  of  the  exponential  distribution. 

9  Summary  and  Future  Work 

In  this  paper,  we  introduced  the  use  of  end-to-end  measurements  of  multicast  traffic  to  infer  network-internal 
characteristics.  We  developed  statistically  rigorous  techniques  for  estimating  packet  loss  rates  on  internal 
links,  and  validated  these  techniques  through  simulation.  We  showed  that  the  inferred  values  quickly  con¬ 
verged  to  within  a  small  error  of  the  actual  values.  We  also  presented  evidence  that  our  techniques  yield 
accurate  results  even  in  the  presence  of  moderate  levels  of  temporal  and  spatial  loss  dependence. 
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We  are  extending  our  work  in  several  directions.  First,  we  are  applying  multicast-based  inference  to 
metrics  other  than  packet  loss.  In  particular,  we  have  developed  estimators  for  link  delay.  We  are  also 
investigating  ways  to  infer  link  bandwidth  and  network  topology  using  multicast  probes.  The  ability  to 
determine  topology  would  free  our  measurements  from  the  assumption  of  a  priori  knowledge  of  topology  or 
of  a  separate  topology-discovery  tool. 

Second,  we  plan  to  do  more  extensive  simulations.  We  plan  to  substitute  RED  queueing  for  FIFO  queue¬ 
ing  to  study  the  effect  of  RED  on  loss  dependence.  We  also  plan  to  substitute  Poisson  probes  for  CBR  probes 
to  avoid  inadvertent  synchronization  of  the  probe  traffic  with  periodic  network  processes.  At  the  same  time, 
we  plan  to  simulate  more  complex  topologies  than  the  simple  examples  used  throughout  this  paper.  Topolo¬ 
gies  other  than  complete  binary  trees  would  stress  our  MFE  for  general  trees,  while  larger  topologies  would 
test  the  convergence  properties  of  our  techniques  on  larger  problem  instances.  This  will  be  complemented 
by  a  theoretical  analysis  of  the  dependence  of  convergence  rates  on  topology.  Furthermore,  we  would  like  to 
explore  how  closely  loss  rates  experienced  by  our  probes  agree  with  loss  rates  experienced  by  other  network 
applications  and  protocols,  for  example  TCR  We  expect  that  our  multicast-based  measurements  will  yield 
ambient  loss  rates  that  are  meaningful  in  a  broad  context. 

Third,  we  plan  to  experiment  with  multicast-based  inference  on  the  Internet.  As  a  preliminary  step,  we 
plan  to  measure  ambient  dependence  in  the  real  network,  and  determine  the  extent  to  which  we  need  to  adapt 
our  estimates  to  their  presence.  We  also  plan  to  deploy  our  inference  tools  in  multicast-enabled  portions  of 
the  Internet,  including  the  MBone,  to  test  our  techniques  on  a  real  network. 

Finally,  we  would  like  to  integrate  our  inference  tools  with  one  or  more  of  the  large-scale  measurement 
infrastructures  under  construction.  NIMI  seems  particularly  suited  because  of  its  intended  role  as  a  general 
framework  where  many  types  of  measurement  can  be  carried  out.  The  challenge  will  be  to  adapt  a  unicast- 
based  infrastructure  to  perform  multicast-based  measurements,  and  in  particular  to  schedule  measurements, 
collect  results,  and  perform  inference  calculations  when  large  numbers  of  receivers  are  involved. 

In  conclusion,  we  feel  that  multicast-based  inference  is  a  powerful  approach  to  measuring  Internet  dy¬ 
namics.  The  rigorous  statistical  analysis  behind  our  techniques  gives  them  a  firm  theoretical  footing,  while 
the  bandwidth  efficiency  of  multicast  traffic  gives  them  much  desired  scalability.  Robust  and  efficient  mea¬ 
surements  are  increasingly  important  as  the  Internet  continues  to  grow  in  size  and  diversity. 

10  Proofs  of  Theorems 

Proof  of  Lemma  1:  Fet  hi{x)  =  (1  —  x),  h2{x,c)  =  h2{x)  =  n8'(l  “  Cix)-  Fet  qi  =  Cj/(1  —  Cix). 
Then  for  x  G  [0, 1]  h'({x)  =  0,  /^"(a;)  =  h2{x)  Qi)^  —  qf'^  >  0.  Hence  h{x)  =  hi{x)  —  h2{x)  is 
strictly  concave  on  [0, 1].  Now  /i(0)  =  0,  h{l)  <  0  and  h'{0)  =  — 1  -|-  Ci  >  0.  So  since  h  is  concave 
and  continuous  on  [0, 1]  there  must  be  exactly  one  solution  to  h{x)  =  0  for  a;  G  (0, 1).  Now  set  write 
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h{x,  c)  =  hi{x)  —  h2{x,  c).  Let  x{c)  be  the  unique  solution  to  h{x{c),  c)  =  0.  The  above  derivation  implies 
that  h'(x(c))  =  (dh(x,  c)/9a;)|^_^(c)  <  0,  so  in  particular,  is  different  from  0.  Since  h  is  continuously 
differentiable,  then  by  the  Implicit  Function  Theorem  [26],  so  is  c  i-)-  a;(c).  | 


Proof  of  Theorem  2:  The  idea  is  to  split  up  the  sum  (2)  into  portions  on  which  ^  is  constant.  These 

will  be  Q(k),  the  Q(f^(k))  \  Q(f^~^(k))  for  i  =  1,  2, . .  .,£(k),  and  12(0)^. 

Consider  first  the  case  that  x  E  ki(k).  Then  occurs  in  p(x)  as  a  factor,  and  hence  ^ =  l/«fc- 
When  X  E  Q(f^(k))  \  Q(P~^(k))  for  i  =  1,2,...,  £(k),  thenp(x)  =  [3 jt-i^i^^Rk{x)  where  Rk{x)  does  not 
depend  on  a/;  (or  indeed  on  any  forj  <  (A;).  Hence  for  a;  G  Q{p{k))\Q{f^~^{k)), 

dlogp{x)  _  1 


Similarly,  when  x  E  12(0)'^ 

On  combining  these: 
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For  the  derivatives,  some  algebra  with  (7)  shows  that 
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The  right  hand  term  in  equation  (55)  follows  by  iterating  the  middle  term.  Observe  that 

n(x)  ^  ^  n(x) 


E 


T/fifc)  -  7/»-i(fc) 


xeaU'{k))\kiU'-Hk)) 

Combining  (53),  (54),  (55)  and  (56)  we  get 
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8  =  1  (^f'~^{k)  m  =  l  l^f^-^(k) 

Here  we  adopt  the  convention  that  the  empty  product  for  i  =  1  means  1,  and  that  the  symbol  7/(o)  that 
occurs  when  i  =  1  £{k)  means  1. 


Set  for  all  k  eV.  For  A;  =  0,  (57)  yields  0  =  70  —  /3o(l  —  To) / /3o>  whence 


To  =  /3o  =  To- 


(58) 
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For  any  other  k,  combining  (57)  for  k  and  j  =  f{k)  yields 


1  ^fk 

whence  —  = 


Ij 


Pj 


7k 

7j' 


(59) 


Together  with  (58)  this  gives  7/;  =  7/;  for  all  A;  G  F. 


Proof  of  Theorem  3:  (i)  By  the  strong  law  of  large  numbers,  7  — >  r(a),  P„  almost  surely,  as  ra  — >  00. 
Since  T  is,  in  particular,  bijective,  then  the  model  is  identifiable,  since  r(a)  =  r(a')  implies  a  =  a' . 

(ii)  Convergence  of  7  to  7  (from  (i))  and  continuity  of  (from  Theorem  1)  yield  convergence  of 
a  =  r“^(7)  to  a  =  r“^(7)  as  ra  — >  00.  We  now  establish  convergence  of  a.  Fix  some  G  (0, 1)^^, 
M  C  (0, 1)^^,  X  E  Q  and  define 

d( X ' 

Z(M,x)  =  inf  log  ^ — -  =  logp(a;;  a°)  —  sup  logp(a;;a^).  (60) 

a’eM  p(x-,a') 

Observe  that  p{x;  a)  is  polynomial  in  the  a^,  and  hence  continuous.  According  to  Lemma  7.54  in  [27],  it 
suffices  to  show  that,  for  each  a'  /  there  is  an  open  set  Na,  containing  a' ,  such  that  X)  > 

—00.  (Here  E„o  is  the  expectation  w.r.t.  P„o). 

Look  at  the  two  terms  in  E„oZ{M,  X)  for  any  M  C  (0,1)^^.  The  first  is  E„o  logp(X;  a°)  = 
^x^qP{x;  a°)  logp(a;;  a°).  This  is  finite  since  p  log p  is  bounded  for  p  G  [0,1]  and  17  is  finite.  For  the  sec¬ 
ond  term,  note  that  p{x;a')  <  1  logp(a;;a')  <  0  sup„;gjy^  logp(a;;  a')  <  0 

—  sup„;gjy^  logp(a;;  a')  >  0  E„oZ{M,  X)  >  E„o  logp(X;a°)  >  —00.  Finally,  we  note  that  although  it 
is  not  mentioned  there.  Lemma  7.54  in  [27]  requires  identifiability,  which  we  proved  in  (i)  above. 

(iii)  Now  let  a  G  (0, 1)^^  be  the  true  set  of  link  probabilities.  From  part  (ii),  with  P„  probability  1, 
the  MLE  d  — >  a  as  ra  — >  00.  Hence,  for  each  sequence  of  probes  we  have  that  for  n  sufficiently  large,  a 
lies  in  the  interior  of  (0, 1)  For  such  n,  a  must  then  solve  the  likelihood  equation  (12).  We  know  from 
Theorem  2,  that  solutions  of  the  likelihood  equation  are  unique,  and  hence  this  d  =  S.  | 


Proof  of  Theorem  4:  (ii)  Recall  V{k)  =  {j  E  V  :  j  Z  k},  R{k)  =  V{k)  n  i?  and  t/  =  C  \  {0}.  Set 
S{a)  =  {Sk{a))keu  Sk{a)  =  -§^{a)  (the  score  vector).  Then  !,■/,(«)  =  Cov(S'j(a),  S'fc(a))  = 
E„(S'j(a)S'fc(a))  since  E„(S'„)  =  «)  =  =  0- 

Suppose  that  T(a)  is  singular  for  some  a  =  {ak)keu  G  (0, 1)^^.  Then  there  exists  some  nonzero 
vector  c  =  {ck)keu  for  which  c  •  I  •  c  =  0.  But  c  •  I  •  c  is  the  variance  of  the  mean-zero  random  variable 
c  •  S (a),  so  then  we  would  have  that  c  •  S (a)  =  0,  P^  almost  surely,  or  equivalently 


keu 


dlogp{x,  a) 
dak 


0  Va;  G  17 


(61) 


since  Pa({a;})  >  0  for  all  a;  G  17.  We  show  that,  in  fact,  (61)  implies  Ck  =  0,  first  for  k  E  R,  then  for  all 

kEU. 
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Let  G  ri  be  such  that  =  1  for  all  j  G  R,  and  for  some  A;  G  i?  let  =  1  for  j  /  A;  and  0  for 
j  =  k.  Then 


a)  =  Oj  while  p[x^^\a)  =  ak 


a,' 


jeu 


jeu\{k} 


and  so  from  (61) 


^  =  0  while  -  ^  +  V  ^  =  0. 

^  n,.. 


(62) 


(63) 


ieu\{k}  ^ 

Combining  the  last  two  equations  we  find  =  0. 

We  now  proceed  by  induction.  For  k  £  U  assume  that  Cj  =  0  for  all  j  ■<  k.  We  now  prove  that  =  0. 
Let  be  as  before,  and  set 


(3)  _  j  1  i  e  i?  \  R{k) 


X  -  '  = 


0  i  G  R{k). 


Then 


p(a;(3)^  a)  =  +  ak(t>k)  Oj 

jev\v{k) 

where  ((>/,  =  njGd(fc)  =  ^a[Xj  =  0  Vj  G  R{k)  \  Xk  =  1].  Hence  from  (61) 


(64) 


(65) 


Cfc(<?^fc  -  1) 

Q'A-  +  4>k0^k  ^ 

3ev\v{k)  ^ 


+  E 


=  0, 


(66) 


recalling  the  assumption  that  Cj  =  0  for  all  j  <  k.  For  the  same  reason  (61)  reads 


—  +  V  —  =  0.  (67) 

Oil.  ^ '  O,' 

XV\v{k)  ^ 

Combining  (66)  and  (67),  then  we  find  =  0.  The  equality  of  v  with  1“^  in  the  interior  of  the  space  of 
parameters  a  is  standard  under  the  conditions  established  during  the  proof  of  Theorem  3;  see,  e.g..  Chapter 
6.4  of  [11]. 

(iii)  We  refer  to  Theorem  7.63  of  [27].  Clearly  C  is  3-times  continuously  differentiable  on  (0, 1)^^, 
and  has  bounded  expectation  in  some  neighborhood  of  a.  This  establishes  the  relation  (7.64)  in  [27]. 

(a)  is  clearly  finite  on  (0, 1)^^.  Hence  T  is  finite  in  (0, 1)^^,  so  together  with  Theorem  3 
and  the  non-singularity  of  I  established  in  (ii)  above,  we  are  able  to  conclude  the  result,  g 


Proof  of  Theorem  5:  Let  j  V  A;  denote  the  nearest  common  ancestor  of  j  and  k,  i.e.  j  V  A;  is  the  ^-least 
common  upper  bound  of  j  and  k.  The  proof  proceeds  by  a  number  of  subsidiary  results.  Since  probes  are 
assumed  independent,  it  suffices  to  evaluate  all  random  quantities  for  ra  =  1  probes. 

(i)  As  ||o||  — y  0, 


(a)  1- Afc  =  s(A;)+0(||a||2);  (b) /3,  =  0(||a||),  (c)  1  -  jk  =  s{k)  +  0{\\af),  (68) 


49 


where 


s{k)  =  '^aj.  (69) 

j>k 

The  relation  (a)  is  clear  hy  expanding  Ak  =  nj>fc(l  “  “j)-  (b)  follows  hy  an  inductive  argument.  Observe 
from  (6)  that  if  (h)  holds  for  all  k  G  d(i),  it  also  holds  for  j.  But  since  fik  =  afc  for  leaf-nodes  k  £  R,  (h) 
holds  for  all  k.  (c)  then  follows  from  the  relation  jk  =  Ak{l  —  nje(i(fc)  f^j)- 
(ii)  As  ||a||  ^  0, 

Cov(7j,7fc)  =  s{j  V  k)  +  0(||a|p)  (70) 


To  see  this,  we  write  —  E[7j]E[%],  and  E[7j]  =  hy  definition.  If  k  is  an  ancestor 

of  i  then  7j  =  1  7;.  =  1  and  so  E[7j7/;]  =  7^.  Similarly,  if  j  is  an  ancestor  of  k,  then  E[7j%]  =  7/;. 

Otherwise  7^-  =  1,7^,  =  1  ^  7jvfc  =  1,  and  so  we  write  E[7j7fc]  =  P[7j  =  1  |  Xjvfc  =  l]P[7fc  =  1  I 
Xjyk  =  l]P[Xjyk  =  1]  =  P[7j  =  l]P[7fc  =  l]/P[-^jvfc  =  1]  =  7j7fcMjvfc.  Thus, 


[  7fc(l  -  7j)  jAk 

Cov(7j,7fc)  =  I  7j(l  -  7fc)  kt  j 

[  7j7fc(lMjvfc  -  1)  otherwise 

(70)  then  follows  from  (68)  and  the  fact  that  j  \/  k  =  j  when  j  7  k. 

(in)  As  ||a||  — >  0, 


D{a)  =  D  +  0{\\a\\)  where  ■= 


1  k  =  j 

-1  k  =  f{j) 

0  otherwise 


To  establish  this,  note  first  that  i7  (a)  has  inverse!}  ^(a)  whose  elements  are  {D{a) 
djijdaj  =  Ji/aj  whenj  A  i.  When  j  A  i,  then  from  the  proof  of  Theorem  2 


dj^  ^  A-  — 

duj  *  duj 


=  A, 


-  n 


n 


Pk 


m  =  l 


ked(Srn(j))\fm-^A 


(71) 


(72) 


d^i  /  da^ .  Now 


(73) 


From  (68)  (b),  this  goes  to  0  as  ||a||  — >  0.  Finally,  for  all  other  j,  7^  does  not  depend  on  ai,  and  so  the 
derivative  is  0.  Summarizing,  as  ||a||  — >  0, 

C(«)-'=C  +  0(||«||)  where  A,  :=  {  J  (74) 

Since  matrix  inversion  is  continuous  in  an  open  neighborhood  of  the  non-singular  matrices,  then  (72)  follows 
if  we  can  show  that  Dij  and  Dij  are  inverses.  First  D^iDij  =  —  Df(k)j  =  required.  Second 

DijDjk  =  Dik  —  ^j^Ak)  The  second  term  is  only  potentially  non-zero  when  k  y  i.\n  this  case 
the  only  term  that  contributes  to  the  sum  is  when  i  A  i,  giving  —1.  Hence  Di^D^  =  ^ik  as  required, 
(iv)  By  (iii),  and  continuity  of  finite  dimensional  matrix  products,  we  have  as  INI  ^  0  that 


=  y]  A'jg(i  viQTifcj'  -fO(||(v|p). 
i,i' 


(75) 
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It  remains  to  evaluate 


^  =  s(i!  V  k)  -  s{iv  f{k))  -  s{f{i)  V  k)  +  s{f{i)  V  f{k)).  (76) 

i,i' 

When  i  =  k,  then  i  \/  k  =  i,  i  \/  f{k)  =  f{i)  V  f{k)  =  f{i)  and  so  (76)  yields  s(i)  —  s(/(i))  =  ai, 
All  other  possible  i  and  k  yield  zero,  as  we  now  show.  i  <  k  then  i  \/  k  =  f{i)  \/  k  =  k,  while 
i  V  f{k)  =  f{i)  V  f{k)  =  f{k),  and  hence  (76)  is  zero.  The  case  k  ^  i  is  similar.  In  all  other  cases 

i,k  ^  i\/  k  and  so  i\/  k  =  i\/  f{k)  =  f{i)  \/  k  =  f{i)  V  f{k).  | 

Proof  of  Theorem  8:  Since  G  (0, 1),  each  Tfc(-)  is  irreducihle,  and  hence  so  is  Y (•),  and  so  q  is 
the  unique  stationary  distribution  for  Q,  i.e.  ^^^QyzQz  =  %■  For  n  probes,  %  where 

qy  =  n~^  '^ra  ^yY(m)-  By  the  Central  Limit  Theorem  for  Markov  processes,  see  e.g.  Chapter  17  of  [15],  q 
is  asymptotically  Gaussian  as  ra  — >  oo  with 

A7C(0,0  (77) 

where 

n  n 

^  =  lim  nCo\j{qy,q^)  =  lim  V  V  Co\j{5  Y(;m)i^zY(:m')) 

n— >-oo  n— >-oo  \  \  > 

m=l  m'  =  l 

oo 

=  QyiSyz  —  Qz)  ^  'y  ^  {Qyz  ~  Qy)Qz- 

m=l 


(78) 

(79) 
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ABSTRACT 

In  this  paper  we  consider  the  problem  of  inferring  link-level  loss 
rates  from  end-to-end  multicast  measurements  taken  from  a  col¬ 
lection  of  trees.  We  give  conditions  under  which  loss  rates  are 
identifiable  on  a  specified  set  of  links.  Two  algorithms  are  pre¬ 
sented  to  perform  the  link-level  inferences  for  those  links  on  which 
losses  can  be  identified.  One,  the  minimum  variance  weighted  av¬ 
erage  (MVWA)  algorithm  treats  the  trees  separately  and  then  aver¬ 
ages  the  results.  The  second,  based  on  expectation-maximization 
(EM)  merges  all  of  the  measurements  into  one  computation.  Simu¬ 
lations  show  that  EM  is  slightly  more  accurate  than  MVWA,  most 
likely  due  to  its  more  efficient  use  of  the  measurements.  We  also 
describe  extensions  to  the  inference  of  link-level  delay,  inference 
from  end-to-end  unicast  measurements,  and  inference  when  some 
measurements  are  missing. 

1.  INTRODUCTION 

As  the  Internet  grows  in  size  and  diversity,  its  internal  behavior  be¬ 
comes  ever  more  difficult  to  characterize.  Any  one  organization  has 
administrative  access  to  only  a  small  fraction  of  the  network’s  inter¬ 
nal  nodes,  whereas  commercial  factors  often  prevent  organizations 
from  sharing  internal  performance  data.  Thus  it  is  important  to 
characterize  internal  performance  from  end-to-end  measurements. 

One  promising  technology  that  avoids  these  problems  uses  end-to- 
end  multicast  measurements  from  a  single  tree  to  infer  link-level 
loss  rates  and  delay  statistics  [1]  by  exploiting  the  inherent  corre¬ 
lation  in  performance  observed  by  multicast  receivers.  A  short¬ 
coming  of  this  technology  is  that  it  is  usually  impossible  to  include 

*This  work  was  supported  in  part  by  DARPA  under  contract 
F30602-00-2-0554  and  F30602-98-2-0238,  and  by  the  National 
Science  Foundation  under  Grant  EIA-00801 19. 
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all  links  of  interest  in  any  one  tree.  Consider  the  network  in  Fig¬ 
ure  1(a)  as  an  example.  In  this  network,  end-hosts  0  and  1  are 
sources,  end-hosts  4  and  5  are  receivers,  and  the  set  of  links  of  in¬ 
terest  is  {(2,  5)  (3,  2)}.  It  is  observed  that  both  tree  1  and  tree  2 

are  needed  to  cover  the  set  of  links  of  interest  as  illustrated  in  Fig¬ 
ure  1(b)  and  1(c).  Therefore,  in  order  to  characterize  the  behavior 
of  a  network  (or  even  a  portion  of  it),  it  is  necessary  to  perform 
measurements  on  multiple  trees.  Inferring  link-level  performance 
from  measurements  taken  from  several  trees  poses  a  challenging 
problem  that  is  the  focus  of  this  paper. 


(a)  Network  (1,)  Tree  1  (c)  Tree  2 

□  O  O 

Internal  Router  End-host  (Source)  End-host  (Receiver) 


Figure  1:  Single  tree  can  not  characterize  a  network 

In  this  paper  we  address  the  following  two  problems.  Given  a  col¬ 
lection  of  multicast  trees,  can  we  infer  the  performance  of  all  of 
the  links  (or  a  specified  subset)  that  are  contained  by  the  trees? 
Second,  when  the  performance  of  the  links  of  interest  can  be  iden¬ 
tified,  how  do  we  obtain  accurate  estimates  of  their  performance? 
Focusing  on  loss  rate  as  the  performance  metric,  we  introduce  and 
evaluate  two  algorithms.  The  first,  the  minimum  variance  weighted 
average  (MVWA)  algorithm,  performs  inference  on  each  tree  sep¬ 
arately  and,  for  each  link,  returns  a  weighted  average  of  the  esti¬ 
mates  taken  from  the  different  trees.  This  procedure  may  not  al¬ 
ways  be  able  to  infer  the  behavior  of  links  whose  loss  rates  are, 
nevertheless,  identifiable.  The  loss  rates  for  these  links  are  ob¬ 
tained  as  a  solution  to  a  set  of  linear  equations  involving  the  in¬ 
ferred  loss  rates  from  individual  trees.  The  second  algorithm,  the 
expectation-maximization  (EM)  algorithm,  on  the  other  hand,  ap¬ 
plies  the  standard  expectation-maximization  technique  [15]  to  the 
measurement  data  taken  from  all  of  the  trees.  It  returns  estimates 
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of  the  loss  rates  of  all  identifiable  links.  We  evaluate  the  two  algo¬ 
rithms  through  simulation  studying  their  convergence  rates  and  rel¬ 
ative  performance.  We  find  that  EM  estimates  are  at  least  as  accu¬ 
rate  than  those  produced  by  MVWA.  The  improvement  is  more  pro¬ 
nounced  when  either  the  number  or  measurements  is  small  or  the 
distribution  of  measurements  among  the  various  trees  is  skewed. 

Although  the  focus  here  is  on  link-level  loss  rates,  we  give  exten¬ 
sions  to  EM  to  handle  link  delay.  In  addition,  we  show  how  MVWA 
and  EM  can  be  applied  when  end-to-end  multicast  measurements 
are  not  available,  or  when  some  measurements  are  missing. 

There  is  a  related  problem  of  how  to  choose  the  set  of  trees  so  as 
to  cover  all  of  the  links  in  the  network  (or  subset  of  interest)  in  an 
efficient  manner.  This  question  has  been  dealt  with  elsewhere,  [2] 
and  is  not  considered  here.  We  take  as  given  the  set  of  trees  and 
observations  from  which  we  are  to  draw  inferences. 

Network  tomography  from  end-to-end  measurements  has  received 
considerable  attention  recently.  In  the  context  of  multicast  prob¬ 
ing,  the  focus  has  been  on  loss,  delay,  and  topology  identifica¬ 
tion.  Extensions  to  unicast  probing  can  be  found  in  [6,  7,  8,  11, 
13].  However,  these  have  treated  only  individual  trees.  There  are 
techniques  for  round  trip  metrics  such  as  loss  rate  and  delay  [14], 
based  on  measurements  taken  from  a  single  node.  Last,  linear  alge¬ 
braic  methods  have  been  proposed  for  estimating  link-level  average 
round  trip  delays  [19]  and  one-way  delays,  [12]  .  Neither  of  these 
extend  to  other  metrics.  Furthermore,  the  latter  only  yields  biased 
estimates  of  average  delays. 

The  remainder  of  the  paper  is  organized  as  follows.  Section  2 
presents  the  model  for  a  “multicast  forest”  (set  of  multicast  trees). 
In  Section  3  we  present  necessary  and  sufficient  conditions  for 
when  the  loss  probabilities  can  be  inferred  from  end-to-end  multi¬ 
cast  measurements.  The  MVWA  and  EM  algorithms  are  presented 
in  Section  4  along  with  convergence  properties  of  the  latter.  Sec¬ 
tion  5  presents  the  results  of  simulation  experiments.  Extensions  to 
delay  inference,  the  use  of  unicast,  and  missing  data  are  found  in 
Section  6.  Last,  Section  8  concludes  the  paper. 

2.  NETWORK  AND  LOSS  MODEL 

Let  N  =  {V{N),  E{N))  denote  a  network  with  sets  of  nodes 
V{N)  and  links  E{N).  Here  {i,j)  £  E{N)  denotes  a  directed 
link  from  node  i  to  node  j  in  the  network.  Let  T*  denote  a  set  of 
multicast  trees  embedded  in  N,  i.e.,  VT  £  ^,V (T)  C  V  (N)  and 
E{T)  C  i7(V).  We  denote  UTe»H(T)  by  V(T')  and  UTe»^(T’) 
by  E{'ii).  Note  that  {i,j)  £  E{'i’)  can  appear  in  more  than  one 
tree.  For  {i,j)  £  E{N),  we  denote  C  tp  the  set  of  trees 
which  include  link  Consider  a  tree  T  £  il/.  Each  node  i  in 

T,  apart  from  the  root  p{T),  has  a  parent  in  T,  f{i,  T),  such  that 
(/(i,  T),i)  £  E{T).  The  set  of  children  of  i  in  tree  T  is  denoted 
by  d{i,  T).  Let  rqr  denote  the  subtree  of  T  rooted  at  node  i.  Let 
R{Ti,T)  denote  the  receivers  in  subtree  rqT-  We  denote  the  path 
from  node  i  to  j,  i,j  £  V{T)  in  tree  T  by  PT{i,j)-  Define  a 
segment  in  T  to  be  a  path  between  either  the  root  and  the  closest 
branch  point,  two  neighboring  branch  points,  or  a  branch  point  and 
a  leaf.  We  represent  a  segment  by  the  set  of  links  that  comprises  it. 

For  T  £  T*,  we  identify  the  root  p{T)  with  the  source  of  probes, 
and  the  set  of  leaves  R{T)  with  the  set  of  receivers.  For  a  tree  T, 
a  probe  is  sent  down  the  tree  starting  at  the  root.  If  it  reaches  node 
j  £  V  (T),  a  copy  of  the  probe  is  produced  and  sent  down  the  tree 
toward  each  child  of  j.  As  a  packet  traverses  link  it  is  lost 


with  probability  1  —  Oij  and  arrives  atj  with  probability  aij.  We 
denote  1  —  Oij  by  aij  .  Let  a  =  iai,j)(i,j)£E{9)-  We  assume 
losses  of  the  same  probe  on  different  links  and  of  different  probes 
on  the  same  link  are  independent,  and  that  losses  of  probes  sent 
from  the  different  sources  p{T),  T  £  4/  are  independent. 

We  describe  the  passage  of  probes  down  each  tree  T  by  a  stochas¬ 
tic  process  Xt  =  {Xk,T)kev(T)  where  Xk,T  =  1  if  the  probe 
reaches  node  k,  0  if  does  not.  By  definition  Xp^T),T  =  1-  If 
Xi^T  ~  0  then  Xj^T  =  0  for  all  j  £  d{i,  T).  If  Xqr  =  1  then  for 
j  £  d{i,T),  Xj^T  =  1  with  probability  aij  and  Vj,t  =  0  with 
probability  aij.  We  assume  that  the  collection  of  trees  is  in  canon¬ 
ical  form,  namely  that  0  <  aij  <  £  E{'^i).  An  arbitrary 

collection  of  trees  can  be  transformed  into  one  with  canonical  form. 

In  an  experiment,  a  set  of  probes  is  sent  from  the  multicast  tree 
sources  p{T),  T  £  'i/.  For  each  T  £  'i/,  we  can  think  of  each 
probe  as  a  trial,  the  outcome  of  which  is  a  record  of  whether  or 
not  the  probe  was  received  at  each  receiver  in  R{T).  In  terms  of 
the  random  process  Xt,  the  outcome  is  a  configuration  Xj^t)  = 
{Xi^T)i£R(T)  of  zeros  and  ones  at  the  receivers.  Notice  that  only 
the  values  of  Xt  at  the  receivers  are  observable;  the  values  at 
the  internal  nodes  are  unknown.  Each  outcome  is  thus  an  ele¬ 
ment  of  the  space  =  {0, For  a  given  set  of  link 

probabilities  a  the  distribution  of  Xjkt)  on  0.r(t)  will  be  de¬ 
noted  Pa,T-  The  probability  of  a  single  outcome  x  £  ^r(t)  is 
p{x;  a)  =  Pa,T[XR(T)  =  x\. 

3.  IDENTIFIABILITY 

In  order  to  perform  tomography  from  measurements  on  the  tree  set 
we  require  that  the  link  probabilities  are  determined  from  the 
set  leaf  probabilities  that  are  measured  directly.  We  phrase  this  in 
terms  identifiability,  which  captures  the  property  that  link  proba¬ 
bilities  can  be  distinguished  by  measurements  from  an  infinite  se¬ 
quence  of  probes.  We  say  that  {Pc«,t}t6®  identifies  a  if  for  any 
{Pc«,t}t6®  =  {PaqrlTef  implies  a  =  a' .  In  this  section, 
we  establish  necessary  and  sufficient  conditions  for  identifiability. 

We  are  given  a  set  of  canonical  trees  with  an  associated  link 
success  probability  vector  a  =  (ai,j)(i,j)6B(>i>).  Let  S  be  the  set 
of  all  segments  within  the  trees  contained  in  'L.  Define  Ps  to  be 
the  logarithm  of  the  probability  that  a  packet  successfully  traverses 
segment  s  £  S  given  that  it  reached  the  start  of  that  segment, 
=  log(n(i,j)es«hf)  =  We  introduce  the 

X  jf^Eifh)  matrix  A  where  =  1  if  link  {i,j)  belongs 

to  segment  s  and  0  otherwise.  Using  the  sets  of  trees  in  Figure  1 
as  an  example,  if  we  order  the  links  as  (0,  2)  (2, 4)  (2,  5)  (1,  3) 
(3,  5)  (3,  2)  and  the  segment  as  {(0,  2)}  {(2,  4)}  {(2,  5)}  {(1,  3)} 
{(3,  5)}  {(3,  2),  (2, 4)},  the  matrix  A  is 

/  1  0  0  0  0  0  \ 

0  1  0  0  0  0 

0  0  1  0  0  0 

0  0  0  1  0  0 

0  0  0  0  1  0 

\  0  1  0  0  0  1  / 

If  we  define  Z(i,j)  =  log  £  E{^),  we  then  have  the 

following  equation 

Az  =  p  (1) 

Here  the  components  of  a  are  2(ij)  and  the  components  of  P  are 
Ps-  Note  that  A  needs  not  be  a  square  matrix  in  general. 
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Before  stating  and  proving  results  on  identifiability,  we  note  that  for 
a  given  set  of  link  probabilities  a,  there  exists  at  least  one  solution, 
namely  «  =  log  a,  to  (1).  Let  A’"  denote  the  matrix  transpose  of 
A. 

Theorem  1.  Let  'i!  be  a  set  of  canonical  loss  trees.  Then  the 
following  are  equivalent: 

(i)  For  some  a,  {Pa, Tires'  identifies  a. 

(ii)  Equation  (1)  has  a  unique  solution  z  =  {jA" A)~^ f). 

(Hi)  Az  =  0  iff  z  =  0. 

(iv)  For  all  a,  {Pa, Tires'  identifies  a. 


4.2  Minimum  Variance  Weighted  Average 

A  technique  for  loss  inference  for  a  single  tree  has  been  proposed  in 
[4].  For  a  given  set  of  trees  we  can  proceed  as  follows:  (1)  con¬ 
sider  each  tree  T  £  ^  separately,  by  using  the  algorithm  provided 
in  [4]  on  the  measurements  ’st-iuT) ;  this  yields  estimates  for  all  seg¬ 
ments  in  T;  (2)  combine  the  estimates  from  the  different  trees. 

We  first  consider  the  problem  of  combining  estimators  of  segment 
transmission  probabilities.  Let  s  be  a  segment,  and  IFs  £  il/  the 
maximal  set  of  topologies  that  include  s  as  segment.  Inference  on 
each  logical  topology  T  £  provides  us  with  an  estimate  qs,T  of 
the  transmission  probability  qa  =  e^‘‘  across  the  segment  s.  How 
should  the  qa.T  be  combined  to  form  a  single  estimate  of  qgl 

We  consider  convex  combinations  of  the  form 

<is  =  ^  At<7s,t,  At  £  [0,1];  ^  At  =  1.  (2) 

Tel's  Tel's) 


Proof.  (i)<F)>(ii).  First,  we  note  that  (3  is  identifiable  from  { Pa.TlTef 
(Theorem  3  in  [4]).  Suppose  that  {Pa,T|Te®  cannot  identify  a, 
i.e.,  there  are  at  least  two  sets  of  link  probabilities,  a  and  a'  that 
are  consistent  with  {Pa,T|Te®-  Based  on  the  derivation  of  (1) 
there  cannot  exist  an  unique  solution  to  (1).  Similarly,  if  a  is  iden¬ 
tifiable,  it  is  obtained  by  solving  (1).  Suppose  that  (1)  does  not  have 
a  unique  solution.  Then,  from  the  derivation  of  (1)  it  follows  that 
there  exist  multiple  values  of  a  that  can  give  rise  to  {Pa,T|T6®. 
Suppose  that  there  exists  a  unique  solution  to  (1).  It  is  easy  to  show 
by  contradiction  that  necessarily  there  is  only  one  value  of  a  that 
can  give  rise  to  {Pa,T|Te'i'-  For  (ii)  <;4>(iii),  observe  that  (1)  has 
a  unique  solution  if  and  only  if  the  nullspace  of  A  is  in  {0|.  In 
this  case  A  is  invertible,  and  the  expression  for  z  then  follows 
on  pre-multiplying  (1)  by  A^ .  (A^ A)~^  A^  is  the  generalized  in¬ 
verse  of  A;  see  [16].  Furthermore,  solutions  of  (1)  must  be  unique 
for  all  a,  and  hence  (ii)4=)>(iv).  □ 

It  should  be  clear  from  this  theorem  that  identifiability  is  a  topo¬ 
logical  property,  i.e.,  not  dependent  on  the  values  a.  We  can  use 
this  fact  to  select  (3  at  our  convenience.  Suppose  we  are  interested 
in  identifying  a  set  of  links  a  set  of  links  C  C  Choosing 

Oij  =  e“^,V(i,  j)  £  results  in  (3a  =  ffs.  Hence  we  have: 


Theorem  2.  Let be  a  set  of  canonical  loss  trees.  {P  a, t}t  £9 
identifies  is  aunique  value  of  {z(i,j)  ■  ihj)  £ 

C|  that  satisfies  equation  (l)for  (da  =  Vs  £  S. 


4.  LOSS  INFERENCE 

In  this  section,  we  describe  two  algorithms  for  loss  inference  in  a 
collection  of  multicast  trees.  In  the  first  algorithm  we  perform  in¬ 
ference  on  each  tree  separately,  and  then  we  take  the  weighted  aver¬ 
age  of  the  different  estimates  so  obtained.  In  the  second  algorithm 
we  perform  inference  on  the  entire  set  of  measurement  from  all  of 
the  trees  using  the  Expectation-Maximization  (EM)  algorithm. 


4.1  Measurement  Experiment 

A  measurement  experiment  for  a  collection  of  multicast  trees  T' 
consists  of  sending  ut  probes  from  p(T),  T  £  T*.  Eor  each  T  £ 
'I',  we  denote  by  yiR(T)  =  Xr^r^  = 

ix'k,T)keR(T))  the  set  measured  of  end-to-end  loss  down  T.  xr  = 
{xr^t-))t£9  will  denote  the  complete  set  of  measurements. 


We  propose  to  select  the  minimum  variance  combination  as  the  sin¬ 
gle  estimator.  By  assumption,  the  ija.T  are  independent,  and  so 

Var(qs)  =  ^  ATVar(qs,T).  (3) 


Var(gs,T)  is  clearly  jointly  convex  in  the  (AtIts'I's'  and  by  ex¬ 
plicit  differentiation  under  the  constraint  X^tg'S  A  =  1,  the  min¬ 
imum  for  Var((|s)  occurs  when 


At 


Var(ga,T)  ^ 
Et's>i>«  Var(q„,T/)-i 


(4) 


Now,  in  general,  Var(qs,T)  depends  on  the  topology  T.  But  it  fol¬ 
lows  from  Theorem  5  in  [4]  that  the  asymptotic  variance  nTVar(qs,T) 
converges  to  q^  +  OdlalP)  as  nr  oo.  Thus,  for  small  loss 
probabilities,  we  can  use  the  approximation  Var(qs,T)  ~  nffqa. 

In  this  approximation,  the  coefficients  At  ~  ttT/X^T'e'i'iT)  xit' ■ 
We  will  use  this  approximation  in  (2)  as  our  minimum  variance 
weighted  average  algorithm  (MVWA)  algorithm,  i.e.. 


^TGM's  ^xqa.T 


(5) 


We  note  two  special  cases:  (i)  s  comprises  a  single  link  in 

which  case  the  estimate  is  for  the  link  rate  aij;  (ii)  only  one  tree 
contains  s,  in  which  case  the  sums  in  (5)  trivially  have  one  term. 

It  remains  to  recover  link  probabilities  from  the  (ja.  Eollowing  The¬ 
orem  1,  identifiable  link  probabilities  aij  are  estimated  by 

log  aij  =  Ali,j),a  log  qa  (6) 

s 

A  simple  example  is  when  two  segments  s,  s'  are  such  that  s  is 
obtained  by  appending  the  link  (i,  j)  to  s'.  Clearly  A^j  j)  a  ~  ^  ~ 
A(;  i)  s'  "'I'-F  (6)  reducing  to  taking  quotients:  A.j  =  <js/<la'- 

4.3  EM  Algorithm 

Here  we  turn  to  a  more  direct  approach  to  inference,  namely,  we 
use  the  Maximum  Likelihood  Estimator  to  estimate  a  from  the  set 
of  measurements  xr,  i.e.,  we  estimate  a  by  the  value  a  which 
maximizes  the  probability  of  observing  xr. 

Let  riT  {xr(t)  )  denote  the  number  of  probes  for  which  the  outcome 
Xr{t)  £  ^R(T)  is  obtained,  T  £  Vl'.  The  probability  of  the  ur 
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independent  observations  Xj^t)  is  then 

ri'j' 

p{^R(T)\a)  =  Yi  pixR(T)-,a) 

m=l 

^R(T) 

and  the  probability  of  the  complete  set  of  measurement  x/?  at  the 
receivers  is 

p(xH;a)  =  P[  p(xj,(T);a).  (7) 

Tef 

Our  goal  is  to  estimate  a  by  the  maximizer  of  (7),  namely, 

(5  =  argmaxp(x_R;  a).  (8) 


In  [4],  a  direct  expression  for  a  are  obtained  for  the  case  of  a  sin¬ 
gle  tree,  i.e.,  when  =  1.  For  the  general  case,  unfortunately, 
we  have  been  unable  to  obtain  a  direct  expression  for  a.  Instead, 
we  follow  the  approach  in  [7,  8],  and  employ  the  EM  algorithm  to 
obtain  an  iterative  approximation  q.^^\  f  =  0, 1, . . . ,  to  S.  To  un¬ 
derstand  the  idea  behind  the  EM  algorithm,  assume  that  we  can  ob¬ 
serve  the  entire  loss  process  at  each  node,  i.e.,  assume  knowledge  of 
the  values  xy  =  {x^, . . . , x^),  (with  each  =  {x^j’)k£v{T)), 
T  e  T*.  In  this  case  estimation  of  a  becomes  trivial:  with  complete 
data  knowledge  it  is  easy  to  realize  that  the  MLE  estimate  of  the 
success  probability  Oij  along  link  ciij,  is  just  the  fraction 

of  probes  successfully  transmitted  along  {i,j),  {i,j)  £  i.e.. 


(ij)  e  Ei'i), 


(9) 


where  nk,T  =  is  the  number  of  probes  sent  from  p{T) 

which  arrived  to  node  k  £  V{T),T  £  ff. 


The  EM  algorithm  assumes  complete  knowledge  of  the  loss  process 
such  that  the  resulting  likelihood  has  a  simple  form.  Since  the  com¬ 
plete  data,  and  thus  the  counts  nk,T  (except  for  the  leaves  nodes) 
are  not  known,  the  EM  algorithm  proceeds  iteratively  to  augment 
the  actual  observations  with  the  unobserved  observation  at  the  inte¬ 
rior  links.  Below  we  briefly  describe  the  algorithm  and  the  intuition 
behind  it.  We  spell  out  the  detail  in  Section7. 


Tree 

Source 

Receivers 

1 

0 

12  13  14  15  16  17  18  19 

2 

1 

12  13  14  15  16  17  18  19 

3 

2 

12  13  14  15 

4 

25 

16  17  18  19 

Table  1:  Tree  layout  for  model  simulation 


As  shown  in  Section7,  the  EM  iterates  converges  to  a  local  (but 
not  necessarily)  global  maximizer  of  (7).  However,  our  simulation 
results  suggests  it  always  converge  to  the  global  maximizer  a  and 
the  convergence  does  not  depend  on  the  initial  values. 

5.  SIMULATION  EVALUATION 

We  evaluate  our  loss  inference  algorithms  using  the  ns  [18]  sim¬ 
ulator.  This  work  has  two  parts:  model  simulation  and  network 
simulation.  In  the  model  simulation,  losses  are  determined  by  time- 
invariant  Bernoulli  processes.  In  the  network  simulation,  losses  are 
due  to  congestion  as  probes  compete  with  other  background  traffic. 
The  majority  of  the  background  traffic  in  the  network  simulation 
is  produced  by  TCP  flows.  However,  we  do  include  some  on-off 
flows  where  the  on  and  off  periods  have  either  a  Pareto  or  an  ex¬ 
ponential  distribution.  We  chose  such  a  mix  because  TCP  is  the 
dominant  transport  protocol  on  the  Internet. 


Figure  2:  Model  simulation  topology:  Nodes  are  of  three  types; 
bold  ellipse:  potential  sender,  ellipse:  potential  receivers,  and 
box:  internal  nodes. 


•  Step  1.  Select  an  initial  link  loss  rate  The  simulation 
study  suggests  the  values  that  the  algorithm  converges  to  are 
independent  of 


•  Step  2.  Estimate  the  (unknown)  counts  nk,T  by  hk,T  = 
Eg(f)  [nfc,T|x_R].  In  other  words,  we  estimate  the  counts  by 
their  conditional  expectation  given  the  observed  data  x^j  un¬ 
der  the  probability  law  induced  by  a.^^\ 


•  Step  3.  Compute  the  new  estimate  via  (9),  using  the 

estimated  counts  hk,T  computed  in  the  previous  step  in  place 
of  the  actual  (unknown)  counts  nk,T-  In  other  words,  we  set 


E 


Tefi 


;  nj,T 


{i,j)  £  Ei'i').  (10) 


•  Step  4.  Iterate  steps  2  and  3  until  some  termination  criterion 
is  satisfied.  Set  a  =  where  £  is  the  terminal  number  of 
iterations. 


5.1  Comparing  loss  probability 

Our  approach  for  comparing  two  sets  of  loss  probabilities  was  first 
introduced  in  [5].  Assume  that  we  want  to  compare  two  loss  prob¬ 
abilities  p  and  q.  For  example  p  could  be  an  inferred  probability 
on  a  link,  q  the  corresponding  actual  probability.  For  some  error 
margin  e  >  0  we  define  the  error  factor 

Fe(p,7)=max{g,g}  (11) 

where  p{e)  =  max{e,  p}  and  q{e)  =  max{e,  q}.  Thus,  we  treat  p 
and  q  as  being  not  less  than  e,  and  having  done  this,  the  error  factor 
is  the  maximum  ratio,  upwards  or  downwards,  by  which  they  differ. 
Unless  otherwise  stated,  we  used  the  default  value  e  =  10“®  in  this 
paper.  This  choice  of  metric  is  motivated  by  the  desire  to  estimate 
the  relative  magnitude  of  loss  ratios  on  different  links  in  order  to 
distinguish  those  which  suffer  higher  loss. 

5.2  Model  simulation 


56 


The  topology  for  model  simulation  is  presented  in  Figure  2.  A  total 
of  four  trees  are  embedded  in  the  topology  as  described  in  Table  1 . 
A  time-invariant  Bernoulli  loss  processes  is  associated  with  each 
link.  In  the  simulation,  uniform  loss  rates  are  assigned  to  all  links. 

We  use  loss  rates  of  2%  and  4%  on  each  link  and  let  each  source 
send  equal  numbers  of  probes  down  to  the  trees.  For  each  loss  rate, 
we  vary  the  total  number  of  probes  sent  by  all  sources  from  50  to 
1600.  Each  setting  is  simulated  ten  times  with  different  random 
seeds.  For  each  simulation,  we  use  both  the  MVWA  and  EM  to  es¬ 
timate  loss  rates  and  compare  with  the  actual  simulation  loss  rates. 

Figure  3  shows  box-plots*  of  error  factors  between  inferred  loss  and 
simulated  loss  over  all  links  and  all  runs.  In  the  figure,  error  factors 
are  displayed  as  a  function  of  number  of  probes  and  one  graph  is  for 
each  loss  rate.  (Note  that  the  total  number  of  probes  increase  ex¬ 
ponentially).  In  each  graph,  we  plot  error  factors  for  both  MVWA 
(abbreviated  as  WA)  algorithm  and  EM  algorithm.  Observed  from 
graph  that  the  estimates  produced  by  EM  algorithm  show  greater 
accuracy  and  less  variability  than  these  produced  by  MVWA  algo¬ 
rithm  under  both  loss  rates  we  simulate  when  the  number  of  probes 
are  small.  However,  as  the  number  of  probes  increases,  the  esti¬ 
mates  yielded  by  both  algorithm  become  more  accurate,  the  dif¬ 
ference  between  two  algorithm  become  less,  and  their  variability 
reduces.  The  same  set  of  simulations  were  done  when  the  numbers 
of  probes  in  each  tree  are  different.  The  results  are  very  close  to  the 
case  where  the  numbers  of  probes  are  equal. 

Note  that  every  link  in  the  topology  described  in  Figure  2  is  a  seg¬ 
ment  in  at  least  one  of  the  trees.  We  also  simulated  a  network  em¬ 
bedded  by  a  collection  of  trees  where  some  links  are  not  a  segment 
in  any  trees  even  they  are  identifiable.  The  error  factors  we  ob¬ 
served  are  very  similar  to  those  presented  in  Figure  3. 

Since  the  EM  algorithm  is  more  accurate  and  of  less  variability 
than  MVWA  algorithm,  we  focus  on  evaluating  EM  algorithm  in 
next  subsection. 


Figure  4:  Small  network  simulation  topology:  Nodes  are  of 
three  types;  bold  ellipse:  potential  sender,  ellipse:  potential  re¬ 
ceivers,  and  box:  internal  nodes. 

5.3  Network  simulation 

In  this  section,  we  simulate  two  topologies,  a  small  network  in  Fig¬ 
ure  4  and  a  multicast  topology  based  on  the  Abilene  network.  In 
both  topologies,  background  traffic  is  generated  by  infinite  TCP 
and  on-off  UDP  flows.  All  the  routers  in  the  network  are  config- 

^In  a  box-plot,  the  box  has  lines  at  the  lower  quartile,  median,  and 
upper  quartile  values.  The  whiskers  are  lines  extending  from  each 
end  of  the  box  to  show  the  extent  of  the  rest  of  the  data.  Outliers 
are  data  with  values  beyond  the  ends  of  the  whiskers. 


Tree 

Source 

Receivers 

1 

0 

20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35 

2 

1 

20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35 

3 

2 

20  21  22  23  24  25  26  27 

4 

5 

28  29  30  31  32  33  34  35 

Table  2:  Tree  layout  for  small  network  simulation 


ured  to  be  droptail  routers  since  the  droptail  routers  are  prevalent  in 
the  Internet. 

Small  network.  The  tree  layout  of  the  small  network  is  described 
in  Table  2.  We  use  constant  bit  rate  probes  and  the  interval  between 
probes  is  100ms.  We  conducted  a  total  of  7  simulations  which  dif¬ 
fer  according  to  the  duration  of  the  measurement.  We  start  with  an 
initial  duration  of  2  seconds  and  double  it  each  time  until  reaching 
128  seconds.  Each  of  these  simulations  is  run  10  times  with  differ¬ 
ent  random  seeds.  For  each  simulation,  we  calculate  the  loss  rates 
using  the  EM  algorithm. 

The  link  losses  in  the  set  of  simulations  are  due  to  all  flows  com¬ 
peting  for  bandwidth.  Since  different  types  of  flows  may  exhibit 
different  behavior,  the  probe  flow  does  not  necessarily  suffer  the 
same  loss  rate  as  the  background  flows  do.  Therefore,  the  error  of 
using  inferred  loss  to  estimate  the  link  loss  may  due  to  one  of  the 
two  possibilities.  Either  probe  traffic  loss  rate  differ  from  all  traffic 
loss  rate  or  the  estimates  yielded  by  the  EM  algorithm  do  not  agree 
with  the  probe  loss  rate.  In  order  to  distinguish  them,  we  compare 
the  inferred  results  to  both  probe  loss  rate  and  all  traffic  loss  rate. 

Figure  5  illustrates  box  plots  of  error  factors  for  all  links  and  all 
simulation  runs.  The  error  factors  are  plotted  as  a  function  of  mea¬ 
surement  time.  On  the  left  we  show  the  error  factor  between  in¬ 
ferred  and  simulated  all  traffic  loss;  on  the  right  between  inferred 
and  simulated  probe  loss.  We  observe  from  both  graphs  in  the  fig¬ 
ure  that  both  the  error  factors  and  their  variabilities  decrease  as  the 
number  of  probes  increase.  The  improvements  are  more  significant 
for  short  measurements. 

We  present  scatter  plots  for  the  all  traffic  loss  vs.  inferred  loss 
on  the  left  and  probe  traffic  loss  vs.  inferred  loss  on  the  right  in 
Figure  6  when  the  measurement  duration  is  128  seconds.  We  make 
two  observations.  First,  the  inferred  loss  rate  almost  always  over¬ 
estimates  the  link  loss  rate.  Second,  the  inferred  loss  rate  provides 
a  very  good  estimate  of  the  probe  traffic  loss  rate.  The  difference 
between  the  inferred  loss  rates  and  all  traffic  loss  rates  is  due  to  that 
the  probe  traffic  endures  a  higher  loss  rate  than  the  rest  of  traffic. 
We  conjecture  that  this  is  because  the  majority  of  the  background 
traffic  come  from  infinite  TCP  flows.  TCP  reduces  its  sending  rate 
when  the  losses  are  detected.  Therefore,  fewer  TCP  packets  will 
suffer  loss.  However,  the  CBR  source  sends  probes  at  a  constant 
rate  which  is  not  affected  by  congestion.  We  expect  the  algorithm 
to  be  more  accurate  in  the  Internet  since  the  Internet  contains  many 
short  lived  TCP  flows  and  many  of  them  complete  transmission 
before  they  respond  to  losses. 

Abilene  network.  Abilene  [21]  is  an  advanced  backbone  network 
that  supports  the  work  of  Internet2  universities  as  they  develop  ad¬ 
vanced  Internet  applications.  One  major  goal  of  Abilene  is  to  pro¬ 
vide  a  separate  network  to  enable  the  testing  of  advanced  network 
capabilities  prior  to  their  introduction  into  the  application  develop- 
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1.2^+  + 


(a)  loss  =  2%  (b)  loss  =  4% 


Figure  3:  Accuracy  of  MVWA(WA)  algorithm  vs.  EM:  Box-plot  of  error  factors  over  all  links  and  all  runs  for  loss  rate  2%(left)  and 
4%(right). 
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Figure  5:  Accuracy  of  EM  algorithm  vs.  probing  time:  Error  factor  over  all  links  and  all  runs 


(a)  All  loss  vs.  inferred  loss 


(b)  Probe  loss  vs.  infeiTed  loss 


Figure  6:  Small  network  scatter  plot:  inferred  loss  vs.  all  loss,  inferred  loss  vs.  probe  loss 
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ment  network.  Multicast  is  one  among  all  such  services.  Abilene 
supports  native  and  sparse  mode  multicast.  As  of  October  01,  2001, 
Multicast  protocols,  PIM-sparse,  MBGP  and  MSDP  have  been  de¬ 
ployed  in  the  backbone.  The  Abilene  multicast  logical  topology  is 
illustrated  in  [22].  It  consists  of  159  nodes  and  165  edges.  Each 
node  in  the  graph  represents  a  physical  location  and  each  link  rep¬ 
resents  a  physical  interconnection  between  some  two  routers  from 
different  locations.  Because  the  more  detailed  topology  within  each 
physical  location  is  not  available  to  us,  we  treat  each  node  as  a 
router  and  focus  on  the  logical  topology  in  our  experiments.  There 
are  three  types  of  links  in  Abilene  backbone,  OC3  (155M),  OC12 
(622M)  and  OC48  (2.5G).  The  type  of  the  links  that  connect  par¬ 
ticipants  to  backbone  are  not  labeled  and  we  assume  they  are  T3 
(45M).  Since  the  ns  simulator  does  not  allow  us  to  simulate  enough 
number  of  flows  to  fill  up  such  high  bandwidth  links  and  generate 
losses,  we  scale  down  the  bandwidth  proportionally  by  10®  times. 
Last,  we  assume  that  only  the  leaves  in  the  topology  (i.e.,  node  of 
degree  one)  are  senders  or  receivers. 

We  lay  out  a  total  of  eight  trees  that  can  identify  41  links.  An  equal 
number  of  probes  is  sent  by  each  source  and  the  interval  between 
probes  is  200ms.  We  conducted  a  simulation  of  duration  256  sec¬ 
onds  and  ran  it  ten  times  with  different  random  seeds.  For  each 
simulation,  we  estimate  the  loss  rates  using  the  EM  algorithm  and 
compare  them  to  the  simulated  loss  rates.  Figure  7  illustrates  scat¬ 
ter  plots  for  inferred  loss  vs.  all  loss  (left)  and  inferred  loss  vs. 
probe  loss  (right).  Similar  to  what  we  observed  in  small  network 
simulation,  the  EM  algorithm  provides  accurate  estimates  of  probe 
loss  rates.  However,  the  inferred  loss  rates  are  almost  always  higher 
than  the  simulated  all  traffic  loss  rates. 


6.  EXTENSIONS 

In  this  section,  we  first  extend  the  EM  algorithm  to  infer  the  dis¬ 
tribution  of  links  delay.  Second,  since  multicast  is  not  supported 
everywhere  in  the  Internet  and  internal  performance  observed  by 
multicast  packets  may  differ  from  that  observed  by  unicast  packets, 
it  is  important  to  show  our  algorithms  for  inferring  a  set  of  multicast 
trees  can  be  applied  to  unicast  measurements.  Last,  the  algorithms 
we  presented  so  far  rely  on  the  availability  of  complete  informa¬ 
tion  from  the  receivers.  However,  this  may  pose  a  serious  problem 
in  their  deployment.  We  demonstrate  the  use  of  our  algorithms  to 
handle  incomplete  observations  from  receivers. 

6.1  Delay  inference 

We  now  illustrate  the  use  of  end-to-end  measurements  from  a  col¬ 
lection  of  multicast  trees  to  estimate  the  delay  characteristics  of 
internal  links. 

We  associate  with  each  link  {i,j)  a  random  variable  Dij  which 
represents  the  queueing  delay  that  would  be  encountered  by  pack¬ 
ets  traversing  link  (i,  j).  For  the  analysis,  we  quantize  the  queueing 
delay  to  a  finite  set  of  values  Q  =  {0,  q,  2q, . . . ,  Bq,  oo},  where  q 
is  a  suitable  fixed  bin  size.  A  queueing  delay  equal  to  oo  indicates 
that  the  packet  is  lost  on  the  link.  We  define  the  bin  associated 
to  ig  e  Q  to  be  the  interval  [iq  —  q/2,  iq  +  q/2),  i  =  1, . . .  ,B, 
and  [Bq  +  q/2,oo)  the  one  associated  to  the  value  oo.  Because 
delay  is  non-negative,  we  associate  with  0  the  bin  [0,g/2).  We 
thus  model  the  link  queueing  delay  by  a  nonparametric  discrete 
distribution  that  we  can  regard  as  a  discretized  version  of  the  actual 
delay  distribution.  We  denote  the  distribution  of  Dij  by  aij  = 
{cti,j{d))d^Q,  where  Oii,j{d)  =  P[Di^j  —  d],d  e  Q.  We  will 
denote  a  =  We  assume  that  queueing  delays 

are  independent  between  different  packets,  and  for  the  same  pack¬ 


ets  on  different  links.  Thus  the  progress  of  each  probe  down  the 
tree  T  is  described  by  an  independent  copy  of  a  stochastic  process 
Yt  =  {Yk,T)kev(T)  which  represents  the  accrued  queueing  delay 
of  packets.  The  queueing  delay  experienced  by  a  packet  from  p{T) 
to  node  i  is  Yi^T  =  E(™,n)epy(p(T),i)  Dm,n  where  pT{p{T),i) 
denote  the  path  on  tree  T  from  source  to  node  i. 

In  an  experiment,  a  set  of  probes  is  sent  from  the  multicast  tree 
sources  p{T),  T  €  4/.  For  each  T  £  4?,  we  can  think  of  each 
probe  as  a  trial,  the  outcome  of  which  is  a  configuration  of  source 
to  receivers  queueing  delays  Yj^t)  =  iXk,T)k&R(T)  we  also  dis¬ 
cretize  to  the  set  Q.  Each  outcome  is  thus  an  element  of  the  space 

As  with  loss  estimation,  we  use  maximum  likelihood  estimation 
based  on  measurements  across  the  multicast  tress  T  £  T*.  Let  us 
dispatch  nr  probes  from  p(T),  T  £  Vk,  and  let  nT{yR(T))  de¬ 
note  the  number  of  probes  for  which  the  outcome  yR(T)  &  ^r(t) 
is  obtained.  The  probability  of  the  ut  independent  observations 
yR(T)  =  •  •  •  >  Vr^t))’  yR(T)  =  {yk,T)keR{T)), 

is  then 

p{yR(Ty,ot)  =  Wp{y']^(Ty,oi) 
m=l 

VR(T)  6ftR(T) 

where  p(y;  a)  =  Pa[YT  ~  yr]-  The  probability  of  the  complete 
set  of  measurements  y_R  =  (yi{(T))Te®  at  the  receivers  is 

p{yR;a)  =  Yl  p{yR(T)-,a).  (12) 

Our  goal  is  to  estimate  a  by  the  maximizer  of  (12),  namely, 

S  =  argmaxp(yR;  a).  (13) 


As  with  loss  inference,  we  resort  to  the  EM  algorithm  to  obtain 
an  iterative  solution  I  =  0, 1, . . . ,  to  a  (local)  maximizer 
of  the  likelihood  (12).  Assume  complete  knowledge  of  the  delay 
process  at  each  link,  namely  the  values  yr  =  (yr,  •  ■  • ,  3/t  )>  (with 
each  y™  =  {y'k,T)k£V(T)),  T  £  4?.  Denote  by  nij^rid)  the  total 
number  of  packets  sent  by  p{T)  that  experienced  a  delay  equal  to 
d  along  link  It  is  easy  to  verify  that  with  complete  data,  the 

MLE  estimate  of  aij  (d)  is 


aij{d) 


(d) 

'^dsQ  ^63, r  (d) 


V(i,i)££(4'). 


(14) 


Thus,  with  complete  knowledge,  the  MLE  estimate  of  aij{d)  is 
simply  the  fraction  of  the  probes  traversing  link  {i,j)  which  en¬ 
countered  a  delay  equal  to  d. 


For  delay  inference  the  EM  algorithm  proceeds  as  for  the  loss  case. 
Below  we  briefly  describe  the  algorithm  and  intuition  behind  it. 
Details  can  be  found  in  [3]. 


1.  Step  1.  Select  the  initial  link  delay  distribution  0*-°^ 

2.  Step  2.  Given  the  current  estimate  Estimate  the  (un¬ 
known)  counts  ni,j,T(d)  by  ni,j,T(d)  =  Eg(^)  [ni,j,T{d)\yR]. 
In  other  words,  we  estimate  the  counts  by  their  conditional 
expectation  given  the  observed  data  yr  under  the  probability 
law  induced  by  . 
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(a)  All  loss  vs.  inferred  loss  (b)  Probe  loss  vs.  infeiTed  loss 

Figure  7:  Abilene  scatter  plot:  inferred  loss  vs.  all  loss,  inferred  loss  vs.  probe  loss 


3.  Step  3.  Compute  the  new  estimate  via  (14),  using  the 

estimated  counts  hij^T{d)  computed  in  the  previous  step  in 
place  of  the  actual  (unknown)  counts  nij^T{d). 

4.  Iteration.  Iterate  steps  2  and  3  until  some  termination  cri¬ 
terion  is  satisfied.  Set  a  =  where  t  is  the  terminal 

number  of  iterations. 

Complexity 

The  complexity  of  the  algorithm  is  dominated  by  the  computation 
the  conditional  expectations  which  can  be  accomplished  in  time  lin¬ 
ear  with  jfV (T)  X  T  G  4>.  The  computation  can  be  done  by 
extending  the  approach  for  computing  loss  conditional  probability 
and  is  described  in  [3]. 

Convergence 

The  conditions  for  convergence  can  be  established  similarly  as  for 
loss  inference. 

6.2  Inference  with  unicast  measurement 

So  far  we  have  presented  inference  algorithms  for  a  collection  of 
trees  based  on  end-to-end  multicast  measurements.  These  tech¬ 
niques  can  be  extended  to  work  with  unicast  measurements  from 
multiple  sources  as  well. 

The  rationale  behind  unicast  based  inference  is  that:  (1)  measure¬ 
ment  domain  is  limited  because  large  portions  of  the  Internet  do 
not  support  network-level  multicast,  and  that  (2)  the  internal  per¬ 
formance  observed  by  multicast  packets  may  differs  from  that  ob¬ 
served  by  unicast  packets.  Techniques  for  unicast  measurements 
and  inference  have  been  recently  proposed  in  [6,  11]  for  the  in¬ 
ference  of  loss  rates  and  [7,  8,  9]  for  delay  distributions.  However, 
these  works  only  handle  the  inference  of  a  single  source  with  multi¬ 
ple  pairs  of  receivers  and  thus  may  pose  severe  limitations  in  scope. 

The  key  idea  behind  unicast  inference  is  to  design  unicast  mea¬ 
surement  whose  correlation  properties  closely  resemble  those  of 
multicast  traffic,  so  that  it  becomes  possible  to  use  the  inference 
techniques  developed  for  multicast  inference;  the  closer  the  corre¬ 
lation  properties  are  to  that  of  multicast  traffic,  the  more  accurate 
the  results. 


A  basic  approach  for  unicast  inference  is  to  dispatch  two  back-to- 
back  packets  (a  packet  pair)  from  a  probe  source  to  a  pair  of  dis¬ 
tinct  receivers.  For  each  such  packet  pair,  the  two  packets  traverse 
a  common  set  of  links  down  a  node  where  their  paths  diverge  to  the 
two  receivers.  By  choosing  multiple  sources  and  pairs  of  receivers, 
it  is  possible  to  cover  a  more  significant  portion  of  a  network  than 
with  a  single  source.  The  inference  for  the  link  loss  probability  and 
link  delay  distribution  from  a  set  of  packet  pair  measurements  is 
formulated  as  a  maximum  likelihood  estimation  problem  which  is 
then  solved  using  the  algorithms  we  presented  earlier  in  the  paper. 
The  idea,  is  that  treat  the  unicast  packet  pair  measurements  as  sta¬ 
tistically  equivalent  to  a  notion  multicast  packet  that  descends  the 
same  tree.  The  entire  set  of  measurements  is  thus  considered  equiv¬ 
alent  to  a  set  of  multicast  measurements  down  a  collection  of  2  leaf 
trees.  The  analysis  then  follows  the  same  approach  for  a  collection 
of  trees  detailed  in  Sections  4  and  6.1. 

6.3  Inference  with  missing  data 

The  algorithms  presented  in  the  paper  so  far  rely  on  the  availability 
of  complete  information  from  the  receivers.  However,  as  described 
in  [10],  this  may  pose  a  serious  problem  in  their  deployment.  For 
example,  the  loss  reports  from  receives  may  be  delivered  unreli¬ 
ably  and  there  may  be  bandwidth  constraints  for  transmitting  loss 
reports.  Therefore,  it  is  important  to  extend  the  algorithms  to  han¬ 
dle  incomplete  data  sets.  An  algorithm  has  been  proposed  in  [10]  to 
handle  incomplete  data  for  a  single  tree.  The  goal  of  this  section  is 
to  extend  the  algorithms  we  proposed  earlier  in  the  paper  to  handle 
incomplete  data  for  a  collection  of  trees. 

The  basic  idea  is  first  to  convert  each  tree  T  G  ^  with  incomplete 
observations  to  multiple  sub-trees  sharing  the  same  source  but  with 
complete  observations.  For  tree  T  with  incomplete  data  in  a  col¬ 
lection  of  tree  assume  that  the  outcomes  of  the  fcth  probe  sent 
by  p(T)  are  only  observable  by  Rk{T)  C  R{T).  With  probe  k, 
we  associate  the  multicast  tree  Tk  that  spans  the  root  p{T)  and 
Rk(T).  This  is  obtained  by  finding  the  spanning  tree  of  p{T)  and 
Rk  {T)  in  T.  Therefore,  the  tree  T  with  incomplete  observation  can 
be  treated  as  a  set  of  trees  {Tk}k=i,...,nT’  of  which  is  with 
complete  observation.  Note  that  the  same  tree  may  appear  many 
times 'm  {Tk}k=t,..., TIT  can  be  merged  as  one  tree  with  multi¬ 
ple  probes.  For  each  tree  with  incomplete  data  in  T',  we  replace  it 
with  the  set  of  its  subtrees  with  complete  data  and  add  these  trees 
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to  We  then  have  a  set  of  trees  each  of  which  has  complete  data 
and  the  algorithms  described  in  Sections  4  and  6.1  can  be  applied 
to  the  inference  of  loss  rate  and  delay  distribution. 


7.  EM  ALGORITHM  FOR  LOSS  INFERENCE 

We  find  convenient  to  work  with  the  log-likelihood  function 

£‘“‘'(x_R;a)  =  ^  a)  (15) 

where 

£‘""T(xH(T);a)  =  T. log p{xR(^Ty,  a)) 
is  the  log-likelihood  of  the  the  measurement  down  the  tree  T  £  'i>. 

We  estimate  a  by  the  maximizer  of  the  likelihood  (15),  namely 
a  =  argmax/)'"‘^T(x/{(T);  a).  We  follow  the  approach  in  [7,  8] 
and  employ  the  EM  algorithm  to  obtain  an  iterative  approximation 
to  the  maximizer  of  (15).  The  basic  idea  is  that  rather  than  perform¬ 
ing  a  complicated  maximization,  we  “augment”  the  observed  data 
with  unobserved  or  latent  data  so  that  the  resulting  log-likelihood 
has  a  simpler  form.  Following  [8],  we  augment  the  actual  obser¬ 
vations  with  the  unobserved  observations  at  the  interior  links.  In 
other  words,  we  assume  complete  knowledge  of  the  loss  process. 
The  log-likelihood  for  the  complete  data  x  =  (xT)Te'i'  is 


£(x;  a)  =  ^  £(xt;  a)  (16) 

Tet 

where  £{x.t;  cx)  =  logp(x'r;  a)  is  the  log-likelihood  of  the  com¬ 
plete  set  data  for  T.  It  is  easy  to  realize  that  p{xt,  ■  ■  ■ ,  ;  a)  = 

(i,j)eE{T)  <^ij  and  that 

r(x;Q;)  =  X]  (  X]  (17) 

+  (  ^  Wi.T-  ^  nj,T)logai,j). 


Maximization  of  (17)  is  trivial,  as  the  stationary  point  conditions 
^§^=0  {i,j)&E{^)  (18) 

UOLi^j 

immediately  yield 


E, 


,  nj,T 


E 


Tii,T 


{i,j)  e  Ei^). 


(19) 


Since  x  and  thus  the  counts  except  for  leaves  are  not  known,  the  EM 
algorithm  uses  the  complete  log-likelihood  /i(x;a)  to  iteratively 
find  a  as  follows: 


1.  Initialization.  Select  the  initial  link  loss  rate  The  simu¬ 
lation  study  suggests  the  values  that  the  algorithm  converges 
to  are  independent  of  initial  values. 

2.  Expectation.  Given  the  current  estimate  Ci'^\  compute  the 
conditional  expectation  of  the  log-likelihood  given  the  ob¬ 
served  data  X  under  the  probability  law  induced  by 


where  =  E~(t)[nk,T\'x.R\-  Q (a' ;  has  the  same 

expression  as  /i(x;  a')  but  with  the  actual  unobserved  counts 
replaced  by  their  conditional  expectations  rik^T-  To 
compute  fik.T,  remember  that  nk,T  =  E^i  Thus, 

we  have 

nk,T  =  ^  ^a{t)[Xk,T  =  1|-^H(T)  =  a;fl(T)l  (21) 

m=l 

=  riTixnpr))^  a(^)[Xk,T  =  l|2ffl(T)  =  ®h(t)] 

^R(T)  ^^R(T) 

3.  Maximization.  Find  the  maximizer  of  the  conditional  expec¬ 
tation  =  arg  maX(j/ Q(q;',  3*-^^).  The  maximizer  is 

given  by  (19)  with  the  conditional  expectation  fik.r  in  place 
of  Uk.T- 

4.  Iteration.  Iterate  steps  2  and  3  until  some  termination  cri¬ 
terion  is  satisfied.  Set  3  =  where  i  is  the  terminal 
number  of  iterations. 


Complexity 

The  complexity  of  the  algorithm  is  dominated  by  computation  of 
the  conditional  expectation  nk,T.  This  can  be  accomplished  in  lin¬ 
ear  time  with  (T),  T  €  Ik.  The  algorithm  is  described  in  [3]. 

Convergence 

We  establish  conditions  for  convergence  of  estimated  parameters 
and  likelihood  under  the  EM  algorithm  for  loss  inference.  Observe 
that  the  complete  data  log-likelihood  function  (17)  can  be  written 

£(x;a)=y^  ni,T0i.T(a)  (22) 

re's  iGV'(T)\{p(T)} 


where 


n 


(23) 


(Here  the  empty  product  when  d{i,  T)  =  0  is  taken  as  1).  Thus 
the  log  likelihood  comes  from  an  exponential  family  with  suffi¬ 
cient  statistics  {ni^T)T&m ,  iev(T)  and  parameters  a.  The  expo¬ 
nential  family  is  regular,  since  we  take  a  in  the  convex  set  A  = 
(0,  l-[jg  Q,  ^  invertible:  = 

<Xf(i,T),i/<Xf(i,T),i  for  a  receiver  i  in  R{T).  Invertibility  then  fol¬ 
lows  by  induction:  if  we  know  all  the  {<Xi,j)j^d(i,T)  then  we  can 
recover  a/(i,T),i  from  4>i.  It  follows  that  the  exponential  family  is 
curved:  the  (pi^T  are  constrained  to  some  ^l/-dimensional  smooth 
submanifold  of  through  the  constraint  that  the 

link  probabilities  a  calculated  from  (pT  on  different  trees  T  must 
agree  on  common  links. 


The  following  convergence  results  for  the  sequence  of  EM  iterates 
3*-^^  follow  from  the  regular  exponential  family  property;  see  The¬ 
orem  6  in  [20]. 


Q(a';aW)  =  i7„(,)[f:(x;a')|xfl]  (20) 

{i,j)eE(^)  (TG'I'i.j 

+  (  Z  -  y]  n j ^t)  log 


Theorem  3.  (i)  {pen-,  converges  to  some  limit  L. 

(ii)  If  {ce  €  A  I  a)  =  L}  is  discrete,  3*-^^  converges 

to  some  a*  that  is  a  stationary  point  o//)‘"'^(x_r;  a). 

(Hi)  If  Iji{x[i\ a)  is  unimodal,  3^^^  converges  to  the  incomplete 
data  MLE,  namely,  3  =  arg  max^E'’^^ (xji;  a) 
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The  theorem  implies  that  when  there  are  multiple  stationary  points, 
e.g.  local  maxima,  the  EM  iterates  may  not  converge  to  the  global 
maximizer.  Unfortunately,  we  were  not  able  to  establish  whether 
there  is  a  unique  stationary  point  or  conditions  under  which  unicity 
holds. 

8.  SUMMARY 

In  this  paper,  we  focused  on  inferring  network  internal  link-level 
performance  from  end-to-end  multicast  measurements  taken  from 
a  collection  of  trees.  We  addressed  two  questions: 

•  Given  a  collection  of  multicast  trees,  whether  all  of  the  links 
(or  a  specified  subset)  are  identifiable. 

•  If  a  set  of  links  of  interest  are  identifiable,  how  do  we  obtain 
accurate  estimates  of  their  performance. 

With  loss  rates  as  performance  metrics,  we  established  necessary 
and  sufficient  conditions  for  identifiability;  and  proposed  two  al¬ 
gorithms,  MVWA  algorithm  and  EM  algorithm  for  inferring  a  set 
of  links  of  interests.  The  algorithms  are  evaluated  through  model 
simulation  and  network  simulation.  The  model  simulation  suggests 
that  the  EM  algorithm  is  more  accurate  and  of  less  variability.  In 
the  network  simulation,  we  observe  that  EM  algorithm  can  provide 
accurate  estimate  to  the  probe  traffic  loss  whereas  over-estimate  all 
traffic  loss  slightly.  Moreover,  we  extend  the  EM  algorithm  in¬ 
fer  link  delays,  and  demonstrate  how  to  use  our  algorithms  when 
only  unicast  measurement  are  available  or  some  of  the  observations 
made  at  end-hosts  are  missing. 
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Abstract 

Network  tomography  using  multicast  probes  enables  inference  of  loss  characteristics  of  internal  net¬ 
work  links  from  reports  of  end-to-end  loss  seen  at  multicast  receivers.  In  this  paper  we  develop  estima¬ 
tors  for  internal  loss  rates  when  reports  are  not  available  on  some  probes  or  from  some  receivers.  This 
problem  is  motivated  by  the  use  of  unreliable  transport  control  protocols,  such  as  RTCP,  to  transmit  loss 
reports  to  a  collector  for  inference.  We  use  a  maximum  likelihood  (ML)  approach  in  which  we  apply 
the  Expectation  Maximization  (EM)  algorithm  to  provide  an  approximate  value  for  the  ML  estimator  for 
the  incomplete  data  problem.  We  present  a  concrete  implementation  of  the  algorithm  that  can  be  applied 
to  measured  data.  Eor  certain  classes  of  models  we  establish  identifiability  of  the  probe  and  report  loss 
parameters,  and  convergence  of  the  EM  sequence  to  the  MLE.  Numerical  results  suggest  that  these  prop¬ 
erties  hold  more  generally.  We  derive  convergence  rates  for  the  EM  iterates,  and  the  estimation  error  of 
the  MLE.  Last,  we  evaluate  the  accuracy  and  convergence  rate  through  extensive  simulations. 

Keywords:  End-to-end  Measurement,  Network  Tomography,  Missing  Data,  Maximum  Likelihood  Es¬ 
timation,  EM  Algorithm,  Multicast,  RTF,  RTCP. 


1  Introduction 

1.1  Motivation 

As  the  Internet  grows  in  size  and  diversity,  its  internal  performanee  beeomes  ever  more  diffieult  to  measure. 
Any  one  organization  has  administrative  aeeess  to  only  a  small  fraetion  of  the  network’s  internal  nodes, 
whereas  eommereial  faetors  often  prevent  organizations  from  sharing  internal  performanee  data. 

One  promising  teehnique  that  avoids  these  problems,  Multicast  Inference  of  Network  Characteristics 
(MINC),  uses  end-to-end  multieast  measurements  to  infer  link-level  loss  rates  and  delay  statisties  by  ex¬ 
ploiting  the  inherent  (and  well  eharaeterized)  eorrelation  in  performanee  observed  by  multieast  reeeivers. 
These  measurements  do  not  rely  on  administrative  aeeess  to  internal  nodes  sinee  the  inferenee  ean  be  ealeu- 
lated  using  only  information  reeorded  at  the  end  hosts. 

The  key  intuition  for  inferring  paeket  loss  is  that  the  arrival  of  a  paeket  at  a  given  internal  node  ean 
be  directly  inferred  from  the  packet’s  arrival  at  one  or  more  receivers  reached  from  the  source  by  paths 
through  that  node;  if  it  reaches  the  receivers,  it  must  have  reached  the  node.  Conditioning  on  arrival  at 

‘This  work  was  supported  in  part  by  DARPA  and  the  AFL  under  agreement  F30602-98-2-0238 
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a  descendent,  we  can  determine  the  probability  of  successful  transmission  to  and  beyond  the  given  node. 
Efficient  inference  algorithms  are  given  in  [2]  for  loss,  [15]  for  delay  distributions,  [9]  for  delay  variances, 
and  [3]  for  inferring  the  logical  multicast  tree  topology  itself.  Extensions  of  these  ideas  to  unicast  (where 
multicast  is  replaced  by  a  packet  pair  [5]  or  a  packet  stripe  [10])  have  also  been  proposed. 

All  of  the  algorithms  based  on  the  MINC  methodology  rely  on  the  availability  of  complete  information 
from  the  receivers.  This  poses  a  serious  problem  in  their  deployment.  Eor  example,  one  promising  avenue  of 
deployment  is  through  the  extension  of  RTCP,  the  RTF  [20]  control  protocol,  to  provide  extended  loss  reports 
[11].  By  piggybacking  MINC  loss  reports  on  a  standard  transport  protocol,  one  can  effectively  co-opt  regular 
applications  and  their  traffic  to  form  a  lightweight  impromptu  measurement  infrastructure  that  encompasses 
many  host  end-points.  However,  loss  reports  are  typically  transmitted  unreliably.  Eurthermore,  the  RTF 
standard  imposes  a  constraint  on  the  bandwidth  that  can  be  used  by  RTCF  packets.  Thus,  this  deployment 
will  result  in  the  availability  of  only  incomplete  data  sets  for  the  purpose  of  network  inference.  The  need 
to  analyze  incomplete  data  sets  also  arises  in  the  extension  of  the  MINC  techniques  to  unicast  as  they  rely 
on  collecting  data  from  subsets  of  receivers.  Thus  there  is  a  need  to  modify  the  inferencing  techniques  to 
be  able  to  handle  incomplete  data  sets.  Using  loss  as  an  example,  the  goal  of  this  paper  is  to  extend  the 
techniques  developed  in  [2]  to  handle  incomplete  data. 

1.2  Contribution 

In  this  paper  we  adapt  the  multicast  inference  techniques  proposed  in  [2]  to  perform  inference  of  internal 
network  characteristics  when  data  is  missing  from  some  of  the  receivers.  The  data  for  the  inference  com¬ 
prises  measured  end-to-end  loss  of  multicast  probes  sent  from  a  source  to  a  number  of  destinations  but 
where  only  a  subset  of  the  destinations  report  their  observations  for  each  multicast  probe.  This  is  used  to 
infer  the  loss  characteristics  of  each  logical  link  of  the  tree  joining  the  source  to  the  destinations,  i.e.,  of  the 
composite  paths  between  its  branch  points. 

A  simple  approach  to  manage  the  impact  of  missing  data  would  be  to  restrict  inference  to  subsets  of 
probes  and  receivers  for  which  complete  data  is  available,  then  patch  together  such  estimators  to  draw 
inference  on  the  complete  tree.  There  are  three  drawbacks  with  this  approach:  (i)  unless  the  coverage  is 
sufficiently  rich,  it  is  not  possible  to  infer  transmission  probabilities  for  all  links;  (ii)  unless  the  missing  data 
distribution  obeys  certain  conditions-known  as  Missing  Completely  at  Random  (MCAR)-such  estimators 
are  not  consistent  in  that  they  remain  biased  even  in  the  limit  of  infinitely  many  probes;  and  (iii)  even  under 
MCAR  such  estimators  are  not  generally  efficient,  i.e.,  there  can  exist  estimators  with  smaller  variance. 

For  these  reasons  we  follow  a  more  direct  approach.  We  extend  the  Maximum  Eikelihood  (ME)  for¬ 
mulation  of  [2]  to  include  the  occurrence  of  missing  data.  The  link  loss  probabilities  are  then  estimated 
by  the  Maximum  Eikelihood  Estimator  (MEE)  arising  from  the  corresponding  likelihood  function.  In  con¬ 
trast  to  the  results  in  [2],  it  is  not  generally  possible  to  determine  the  MEE  by  simple  root  finding  when 
dafa  is  missing.  Insfead,  we  use  fhe  Expecfafion  Maximizafion  (EM)  algorifhm  [7]  fo  generafe  an  approx¬ 
imating  sequence  fo  fhe  corresponding  MEE.  We  now  oufline  fhe  remainder  of  fhe  paper  and  fhe  defailed 
confribufions. 

In  Section  2  we  sef  up  models  for  fhe  mulficasl  free,  probe  propagafion,  and  reporf  loss,  and  review  some 
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results  for  loss  inference  from  complete  data  from  [2].  We  describe  the  model  frameworks  for  missing  data 
and  give  two  examples:  inference  using  unicast  stripes  [10],  and  inference  using  extended  RTCP  reports, 
as  proposed  in  [1].  In  Section  3  we  set  up  the  incomplete  data  likelihood  function,  and  describe  the  EM 
algorithm  and  its  application  to  the  present  model.  We  establish  conditions  required  for  convergence  of  the 
EM  iterates  to  the  MEE.  We  translate  these  into  conditions  on  the  measured  data.  If  these  conditions  are 
not  fulfilled,  it  is  possible  to  pass  instead  to  one  or  more  related  inference  problems  on  subtrees  for  which 
the  conditions  are  fulfilled.  In  Section  4  we  tailor  the  EM  algorithm  to  our  specific  problem  and  present  an 
algorithm  for  use  on  measured  data.  Section  5  addresses  conditions  for  identifiability  of  model  parameters 
and  relates  these  to  topological  properties  of  families  of  subtrees  on  which  complete  measurements  can  be 
made.  Convergence  of  the  MEE  as  the  number  of  probes  grows  is  investigated  in  Section  6;  in  particular  we 
obtain  explicit  expressions  for  the  asymptotic  variance  of  the  MEE  for  a  class  of  simple  models.  A  related 
expression  for  the  convergence  rate  of  the  EM  iterates  is  obtained  in  Section  7.  The  algorithm  from  Section  4 
is  evaluated  in  model-based  simulation  and  using  experimentally  derived  traces  in  Section  8.  We  conclude 
in  Section  9.  Details  of  most  proofs  are  deferred  to  Section  10. 

1.3  Related  work 

Several  tools  and  methodologies  exist  for  characterizing  link-level  behavior  from  end-to-end  multicast  mea¬ 
surements.  However,  most  of  these  require  complete  data  from  all  of  the  receivers  in  the  multicast  tree. 
These  include  the  MINC  methodologies  for  losses,  [2],  and  for  delay,  [15,  9]  and  topology  characteristics, 
[3].  These  methodologies  have  been  adapted  to  unicast  through  the  transmission  of  packet  pairs  [5]  or 
stripes  [10]  to  pairs  of  receivers  within  a  distribution  tree.  The  data  then  consists  of  observations  from  pairs 
of  receivers  and  can  be  interpreted  as  observations  in  which  the  data  is  missing  from  all  but  these  pairs  of 
receivers.  The  methodology  presented  in  [10]  treats  the  problem  as  separate  problems  corresponding  to  each 
pair  of  receivers  and  produces  link  estimates  by  averaging  over  all  of  the  estimates  produced  from  each  of 
these  receiver  pair  problems. 

In  [5],  the  authors  introduce  an  additional  link  parameter,  namely  the  conditional  probability  that  the 
second  packet  within  a  pair  is  not  lost  given  that  the  first  packet  is  not  lost.  The  authors  then  treat  the 
outcomes  of  the  each  of  the  packets  in  a  pair  within  the  tree  as  unobserved  data  and  use  the  EM  algorithm 
to  infer  the  link  probabilities  and  conditional  link  probabilities.  Due  to  the  complexity  of  this  task,  they 
propose  a  heuristic  for  inferring  these  parameters.  Because  we  rely  on  multicast,  our  task  is  simplified  as 
we  only  have  one  set  of  link  parameters  to  infer.  Our  solution  methodology  uses  the  EM  algorithm  to  obtain 
a  solution  to  the  likelihood  equation.  Coates  and  Nowak  have  extended  their  EM-based,  unicast-based 
techniques  to  infer  delay  statistics  in  [6]. 

East,  there  exist  several  approaches  that  infer  round  trip  link  behavior.  These  include  pathchar  [8,  12] 
and  the  linear  algebraic  approach  of  [21].  The  former  infers  loss,  delay,  and  available  link  bandwidth 
whereas  the  latter  infers  round  trip  link  delays.  The  former  requires  considerable  time  to  converge.  Both 
lose  accuracy  with  asymmetric  round  trip  paths. 
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2  Models  for  Probe  and  Report  Transmission 

2.1  Tree  model 

Let  T  =  {V,L)  denote  a  logical  multicast  tree  with  nodes  V  and  links  L.  We  identify  one  node,  the  root 
p,  with  the  source  of  probes,  and  set  of  leaves  R  C  V  with  the  set  of  receivers.  We  assume  that  the  root 
has  single  child,  denoted  by  1.  If  not,  then  we  can  treat  separately  the  trees  descended  through  each  child  of 
p,  each  one  having  this  property.  Each  node  k,  apart  from  the  root  p,  has  a  parent  f{k)  such  that  {f{k),  k) 
is  a  link  in  L.  We  will  sometimes  refer  to  the  link  {f{k),  k)  that  terminates  at  k  simply  as  link  k.  Define 
recursively  the  ancestors  of  k  by  f^{k)  =  f{f"'~^{k))  with  f^{k)  =  k.  We  say  j  is  descended  from  k,  and 
write  j  k,  if  k  =  f^{j)  for  some  n  G  N.  The  set  of  children  of  k,  namely  {j  E  V  :  f{j)  =  k}  is  denoted 
by  d{k).  T{k)  =  {V (k),  L{k))  will  denote  the  subtree  rooted  at  k',  R{k)  =  RriV{k)  is  the  set  of  receivers 
in  T{k).  Define  [/  =  F  \  {p}. 

2.2  Packet  loss  model 

We  assume  a  Bernoulli  loss  model  in  which  probes  are  independent  and  each  probe  is  successfully  trans¬ 
mitted  across  link  k  with  probability  ak .  Thus  the  progress  of  each  probe  down  the  tree  is  described  by  an 
independent  copy  of  a  stochastic  process  X  =  {Xk)kev  as  follows.  Xp  =  1.  X^  =  1  if  the  probe  reaches 
node  k  E  V  and  0  otherwise.  If  X^  =  0,  then  Xj  =  0,  Vj  -<  k.  Otherwise,  P[Xj  =  l\XfQ-^  =  1]  =  Qfj  and 
P[Xj  =  =  1]  =  1  —  Qfj.  We  adopt  the  convention  Op  =  1  and  denote  a  =  {ai)i^v  Pa  will  denote 

the  distribution  of  X. 


2.3  Inference  of  link  loss  from  complete  data 


When  a  probe  is  sent  down  the  tree  from  the  root  p,  we  cannot  observe  the  whole  process  X.  We  assume 
that,  at  most,  we  know  only  the  outcome  {Xk)keR  G  =  {0, 1}^  that  indicates  whether  or  not  the  probe 
reached  each  receiver.  When  the  entire  outcome  for  a  probe  is  known  (i.e.  X^  for  all  receivers  k),  we 
shall  say  that  we  have  complete  data  from  that  probe.  In  [2]  it  was  shown  how  the  link  probabilities  can  be 
determined  from  the  the  distribution  of  (complete)  outcomes.  We  briefly  review  this. 

Consider  an  experiment  in  which  n  probes  are  dispatched  from  the  root  p.  Each  probe  i  =  1, ...  ,n 
gives  rise  to  an  independent  realization  of  the  probe  process  X.  We  call 


cplt 


IkeR 


(1) 


the  complete  data  for  the  experiment.  Eor  each  outcome  x  E  fl,  let  n{x)  denote  the  number  of  probes 
*  =  1 , . . .  n  for  which  =  x^  for  all  k  E  R.  Eet 


Pa{x)  =  Pa[Xk  =  Xk,  Mk  G  R] 


(2) 


denote  the  probability  of  an  outcome  rr  G  fl.  The  complete  data  log-likelihood  to  obtain  the  data  2tpit  = 
. . . ,  can  be  written  in  terms  of  the  n{x)  as 

£c(a)  =  log  Pa[2Lcpit]  =  ^  n{x)  logpa{x).  (3) 
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We  characterize  the  Maximum  Likelihood  Estimator  (MLE)  of  a,  namely,  wgmn\^C{a),  as  follows. 
Eor  k  E  V,  let  be  the  probability  that  the  probe  reaches  k.  Thus  Af.  =  ctk,  the  product  of  the 
probabilities  of  successful  transmission  on  each  link  between  k  and  the  root  p.  Eor  each  k  E  U  set 

Ik  =  jeRik)^j]  (4) 

i.e.,  7;t  is  the  probability  that  a  probe  reaches  at  least  one  receiver  descended  from  node  k.  Denote  by  7  the 
corresponding  empirical  quantity,  i.e.,  the  proportion  of  the  n  probes  that  reach  at  least  one  leaf  descended 
from  k: 

n 

Ik  =  n-^  '^jeR{k)Xf.  (5) 

i=l 

In  what  follows  we  consider  a  to  lie  in  the  open  parameter  set,/l  =  {Q;  |  Ok  £  (0,1), A:  gI7}.  Some  of  the 
results  of  the  following  theorem  also  hold  on  subsets  of  the  boundary  of  A. 

Theorem  1  ([2])  Assume  a  G  A 

(i)  For  each  k  E  U, 

(1-7A:/Aa;)=  Y[  -'Jj/Ak),  (6) 

jed(k) 

with  the  convention  that  an  empty  product  occurs  when  k  ^  Ris  zero. 

(ii)  Let  Q  =  {{'yk)keu  '■  Ik  >  OVA:;  yk  <  J2jed{k)  R}.  For  each  7  G  ^  and  k  E  U,  (6)  has 

a  unique  solution  Rkil)  In  the  interval  (7^,,  1). 

( Hi )  When  7  G  the  likelihood  equation, 

dC 

^(a)  k^U  (7) 

has  as  a  unique  solution 

Ok  =  fCkil)  ■=  Rkil)/'Hf(k)il),k  G  U.  (8) 

( iv)  With  probability  one,  for  sufficiently  large  n,  both  a  and  the  MLE  of  a  lie  in  A,  and  are  hence  equal. 

(v)  The  parameters  a  are  identifiable,  i.e.,  Pq,  =  Pq,/  for  a,  a'  G  A  implies  a  =  a' . 

It  turns  out  that  Theorem  I(iv)  is  weaker  than  required  for  the  present  paper.  We  now  establish  a  stronger 
version  that  provides  a  test  as  to  whether  or  not  Ok  is  the  MEE  for  finite  n. 

Theorem  2  Assume  f  ^  Q.  If  a  ^  A  then  it  is  the  MLE  for  a. 

Remark:  Theorem  l(iv)  establishes  that  for  n  sufficiently  large,  the  MEE  lies  in  A  and  hence  must  be  a, 
the  solution  of  the  likelihood  equation.  Theorem  2  is  more  useful  from  the  computational  point  of  view;  it 
says  that  provided  a  lies  in  A,  a  condition  that  can  be  checked  by  inspection,  it  is  the  MEE,  regardless  of  n. 

As  a  consequence  of  the  MEE  property,  a  is  consistent  q,  ^vith  probability  1),  and  asymptoti¬ 

cally  normal  {-^{a  —  a)  converges  in  distribution  to  a  multivariate  Gaussian  random  variable  as  n  — )■  00); 
see  e.g.  [19]. 
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2.4  Missing  data  model 

We  now  want  to  generalize  the  problem  by  admitting  the  possibility  that  some  outeomes  may  not  be  eom- 
pletely  known  beeause  the  reeeiver  variables  are  missing.  Let  T  =  denote  the  n  x  matrix 

of  missing  data  indieators,  with  3^  '  taking  the  value  0  if  the  variable  Xf, '  is  missing,  and  =  1  if  it  is 
present.  The  set  of  observed  data  and  missing  data  are  thus,  respeetively, 

^obs  =  =  1}  and  =  {wf  I  Tf  =  0}.  (9) 

In  this  paper  we  assume  that  the  missing  data  meehanism  is  ignorable  in  a  sense  we  now  deseribe; 
see  [14]  for  further  details.  We  treat  T  as  a  random  variable  whose  distribution  is  parameterized  by  some 
quantity  9.  Pq  will  denote  the  distribution  of  T  under  6,  and  ^  the  joint  distribution  of  and  T. 

We  heneeforth  assume  that  the  missing  data  is  missing  at  random  (MAR).  This  is  the  property  that  the 
distribution  of  the  missing-data  meehanism  T  does  not  depend  on  the  missing  values  ^ig.  More  formally, 
we  ean  write  the  MAR  property  as  Pg [T  |  Xobs ,  -^mis]  =  \  Xobs] ■  As  a  eonsequenee  of  MAR  it  ean  be 

shown  that  the  joint  distribution  of  the  observed  data  and  the  missing-data  meehanism  enjoys  the  following 
faetorization  property: 

P0,a[^obs,T]  =  P4T  I  Aobs]Pa[^obs].  (10) 

Assuming  the  parameters  (a,  9)  to  be  distinet  with  produet  parameter  spaee  .4x0,  (10)  says  that  the 
missing  data  meehanism  is  ignorable  in  that  likelihood-based  inferenee  for  a  based  on  the  joint  likelihood 
Pg  Q,[Aobs5  T]  are  the  same  as  those  based  upon  Pa[A^obs]-  Thus  for  purposes  of  inferring  a,  we  may  ignore 
the  parameters  9  of  the  missing  data  meehanism.  A  speeial  ease  of  MAR  is  data  missing  completely  at 
random  (MCAR).  With  MCAR  the  missingness  probabilities  do  not  depend  on  any  data:  Pg]^  \  Xobs]  = 
^e[T]. 

2.5  Examples 

We  deseribe  two  applieations  where  data  is  missing  and  plaee  them  into  the  framework  deseribed  above. 

Inference  using  unicast  data.  In  [10],  the  authors  deseribe  an  approaeh  to  unicast  based  inferenee  in 
whieh  n  sets  of  paekets,  known  as  stripes,  are  transmitted  by  a  souree  to  all  reeeiver  pairs.  The  motivation 
is  that  within  eaeh  stripe,  paekets  are  transmitted  baek-to-baek,  and  so  their  loss  behavior  on  eommon  links 
should  be  highly  eorrelated.  With  perfeet  eorrelations  (i.e.  both  paekets  being  either  transmitted  or  lost  on 
a  common  link)  the  stripe  has  the  same  behavior  as  a  notional  multicast  packet  that  follows  the  same  route 
and  is  subject  to  the  same  loss. 

We  can  put  each  receiver  pair  in  correspondence  with  a  missing  data  indicator  as  follows.  = 
{Tj.^^)keR  identifies  the  pair  of  receivers  corresponding  to  the  f-th  stripe,  i.e.,  3^*^  =  =  1,  =  0, 

I  E  R,  I  j,k  if  the  pair  of  receivers  is  j,  k  ^  R,  j  ^  k.  Thus  missingness  of  data  from  probe  i  at  receiver 
i  occurs  because  i  is  not  a  member  of  the  pair  of  receiver  nodes  selected  for  the  probe. 
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If  the  receiver  pairs  are  chosen  independently  from  stripe  to  stripe  using  the  same  distribution,  then 
T  =  is  a  sequence  of  IID  random  variables.  Thus  T  has  the  following  distribution, 

n 

p[T  =  f]  =  JJp[tW  Vie  {0,1}#^  (11) 

i=l 

where 

P[TW=f]=  ^  =  1}1{4  =  1}  n  =  Vi  G  {0,1}#^ 

j,keR-,j^k  leR\{j,k} 

Here  pj^k  is  the  probability  that  the  pair  of  receivers  j  and  k  is  chosen.  If  we  further  assume  that  T  is 
independent  of  X,  then  the  data  is  MCAR. 

Another  variation  has  the  sender  cycle  through  the  pairs  in  a  round  robin  fashion.  Let  tt  :  i?  \  {(j,  j)  : 
j  G  R}  — )■  {1, . . . ,  m}  be  a  one-to-one  mapping  where  7r(j,  k)  is  the  position  in  the  round  robin  schedule 
where  a  probe  is  sent  to  receiver  pair  j  and  kwA  m  =  x  {#R  —  1).  The  joint  probability  distribution 
for  T  is  given  by  (1 1)  with 

^  l{7r(j,A:)=d}l{f,  =  l}l{4  =  l}  n  (12) 

j,keR-,j^k  leR\{j,k} 

for  all  t  G  {0, 1}^^,  *  >  0, 1  <  d  <  m. 

Inference  using  RTP/RTCP.  The  Reliable  Transport  Protocol  (RTP)  [20]  is  a  protocol  for  the  transfer 
of  data  from  a  single  sender  to  one  or  more  receivers.  Associated  with  it  is  a  control  protocol  RTCP  that 
allows  receivers  to  broadcast  loss  behavior  to  each  other  and  to  a  third  party.  Typically,  the  observations 
are  batched  and  each  batch  is  broadcast  as  a  single  report.  The  third  party  can  collect  the  observations 
and  apply  inference  methodologies  to  them.  However,  these  reports  are  typically  not  transmitted  reliably. 
Consequently,  the  data  collector  must  deal  with  missing  information. 

In  the  current  implementation  of  RTCP,  receivers  broadcast  only  average  loss  rates.  Extensions  to  the 
protocol,  proposed  in  [1],  enable  receivers  to  report  on  the  reception  of  individual  packets.  However,  due 
to  the  constraints  imposed  on  reporting  volumes  by  RTCP,  it  may  not  be  possible  to  report  on  every  packet. 
The  omission  of  certain  reports  to  fulfill  this  constraint  is  thus  an  additional  source  of  missingness. 

We  propose  a  simple  model  for  this  scenario.  Consider  receiver  j  E  R  that  collects  loss  observations 
and  sends  them  to  a  data  collector.  Let  {{A.,j}iZi}jeR  be  a  set  of  random  variables  where  Ajj  is  the 
number  of  observations  placed  in  the  f-th  report  by  the  j-th  receiver.  Let  be  indicator 

random  variables  that  represent  the  outcome  of  the  transmission  of  the  k-th  loss  report  by  receiver  j  to  the 
data  collector;  it  is  received  by  the  collector  if  =  1  and  lost  otherwise.  Define  7r(f,j)  =  min{f  : 
^k=l  ^k,j  <  *  <  Ylk=i  ^k,j}^  i-e.,  7r(f,  j)  idenfifies  which  reporf  fhe  f-fh  observation  from  receiver  j  is 
placed  in.  Lef  be  a  sef  of  indicator  random  variables  fhaf  represenf  whefher  probe  i  was 

acfually  selected  for  reporting  from  receiver  j;  if  is  selected  if  =  1.  Then  fhe  missing  dafa  indicator 
i  =  1, . . .  ,n,  j  E  R  can  be  expressed  as 
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Under  strong  simplifying  assumptions,  namely  that  the  random  variables  A,  S  and  C  are  independent  of 
X,  the  model  is  MCAR.  However  we  ean  posit  a  situation  in  whieh  independenee  may  not  hold  in  praetiee. 
Suppose  the  eolleetor  lies  at  a  node  k  in  the  multieast  tree.  Then  the  path  for  reports  from  reeeivers  in 
R  \  R{k)  to  k  interseets  with  the  paths  of  probe  paekets  from  p  to  reeeivers  in  R{k).  Thus  we  may  expeet 
the  missingness  variables  :  j  E  R  \  R{k)}  to  be  eorrelated  with  the  reeeiver  state  {Xj  :  j  G  R{k)}. 
This  is  preeisely  the  type  of  model  allowed  when  data  is  MAR. 

2.6  Approaches  to  the  prohlem  of  missing  data 

It  is  tempting  to  reduee  the  problem  of  inferenee  with  missing  data  to  a  eomposite  of  known  inferenees  by 
performing  inferenee  using  subsets  of  probes  for  whieh  reports  reaehed  leaf  deseendents  of  a  given  node. 
A  simple  approaeh  to  manage  the  impaet  of  missing  data  is  to  restriet  inferenee  to  subsets  of  probes  and 
reeeivers  for  whieh  eomplete  data  is  available,  then  pateh  together  sueh  estimators  to  draw  inferenee  on  the 
eomplete  tree.  A  minimal  way  to  do  this  would  be  to  use  only  probes  for  whieh  reports  were  reeeived  from 
all  reeeivers.  A  more  sophistieated  approaeh  is  the  following: 

(a)  For  eaeh  k  E  R,  estimate  =  %  by  the  fraetion  of  observed  reports  indieating  probe  reeeption. 

(b)  For  eaeh  k  E  U\R\et  TZj^  denote  the  set  of  subsets  of  R{k)  in  whieh  eaeh  member  is  the  deseendant 

of  a  different  ehild  of  k.  For  eaeh  r  G  7^,  use  only  probes  with  reports  reeeived  from  all  j  G  r  to 
form  the  fraetions  %{r)  and  Estimate  Ak  =  {Y.re'Rk'^k{l{r))/#Rk,  i-e.,  averaging 

over  the  r  G  TZ^- 

(e)  Estimate  link  transmission  probabilities  =  AklAf(^p^. 

However,  sueh  “patehwork”  approaehes  have  three  pitfalls: 

(i)  Unless  the  eoverage  is  suffieiently  rieh,  it  is  not  possible  to  infer  transmission  probabilities  for  all 
links.  If  not  all  nodes  are  braneh  points  of  some  “eomplete  data”  subtree,  it  follows  from  one  of  our 
later  results  that  one  eannot  infer  the  transmission  probability  for  the  link  that  terminates  at  that  node. 
In  the  minimal  ease,  there  may  be  no  probes  for  whieh  reports  are  reeeived  from  all  probes. 

(ii)  Sueh  estimators  are  not  eonsistent  unless  data  is  MCAR;  we  illustrate  with  an  example  in  Seetion  3.1. 
Eurthermore,  eheeking  whether  a  given  data  set  is  eonsistent  with  the  MCAR  property  may  be  a 
eomplex  task  sinee  the  number  of  eonsisteney  eonditions  that  would  have  to  be  eheeked  grows  expo¬ 
nentially  with  the  number  of  leaves  in  the  tree. 

(iii)  Even  under  MCAR  sueh  estimators  are  not  generally  effieient,  i.e.,  there  ean  exist  estimators  with 
smaller  varianee. 

Eor  these  reasons  we  instead  extend  the  previous  ME  approaeh  to  eover  the  missing  data  ease  direetly: 
under  general  eonditions  ME-estimators  are  eonsistent  and  effieient.  This  is  the  subjeet  of  the  next  seetion. 
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3  Estimation  of  Link  Loss  Rates  with  Incomplete  Data 


In  this  section  we  present  the  likelihood  function  C  for  the  incomplete  data.  Determination  of  the  cor¬ 
responding  ML  estimator  for  the  link  probabilities  turns  out  to  be  significantly  more  complex  that  in  the 
complete  data  case.  We  turn  to  a  standard  iterative  method,  the  EM  algorithm,  to  derive  an  approximating 
sequence  to  the  incomplete  data  MLE. 


3.1  Description  of  incomplete  data  and  the  likelihood  function 


The  corresponding  incomplete  data  likelihood  function  is  the  marginal  distribution  function  of  the  observed 
data;  formally  we  write  this  as  /  PQ,[Xobs5 We  now  obtain  an  explicit  expression.  In  order 
to  represent  both  missing  and  observed  data  in  a  compact  form,  we  extend  the  set  of  outcomes  to  the  set 
fi*  =  {0, 1,  u}^,  where  u  is  used  to  denote  that  a  given  receiver  datum  is  missing,  u*  =  (u, . . . ,  u)  G  fi* 
will  denote  the  outcome  in  which  data  is  missing  from  all  receivers.  Eet  t  G  {0, 1}^  denote  the  generic 
vector  of  missing  data  indicator  variables.  With  each  such  t  and  rr  G  we  then  associate  an  element  x{t)  of 
0,*  through 


Xk{t)  = 


iftk=0 


k  e  R. 


Xk  otherwise 

An  inverse  of  the  above  map  associates  with  rr*  G  fi*  its  missing  data  indicator  t{x*)  by 


(13) 


tk  (x*) 


0  if  =  u 
1  otherwise 


k  e  R. 


(14) 


The  set  of  complete  outcomes  x  that  can  give  rise  to  an  incomplete  outcome  af  is  the  set 


0,{x*)  =  {x  E  O,  \  x*k  =  Xk  tk{x*)  =  1},  and  conversely  0,*{x)  =  {x*  G  0*  |  rr  G  fi(rE*)}  (15) 


is  the  set  of  complete  outcomes  x*  that  can  be  obtained  from  a  complete  outcome  x.  The  equivalent  condi¬ 
tions  X  G  Q{x*)  and  x*  G  Q*{x)  can  be  rewritten  as  x{t{x*))  =  x* . 

The  probability  to  record  an  incomplete  outcome  (t(*)  )  =  rr*  is  denoted 

=  (16) 

Now  {Ar(*)(T6))  =  X*}  =  G  fi(rE*)}  fl  =  t{x*)}.  Using  the  MAR  property  (10)  we  factorize 

(17) 

where 

pUx*)  =  P«[W«  G  f)(rr*)]  =  ^  p«(rr)  and  0{x*)  =  =  t{x*)  \  G  n{x*)].  (18) 

Without  loss  of  generality  we  have  taken  the  missingness  probabilities  themselves  as  parameters  9.  Note  that 
by  the  MAR  property,  for  any  rr  G  9.{x*),  P4t(*)  =  t{x*)  \  G  f)(rE*)]  =  P4t6)  =  t{x*)  \  =  x]. 

Since  {t{x*)  \  x*  G  fi*(rE)}  =  {0, 1}^,  the  conditional  probabilities  6  satisfy 

e{x*)  =  1,  Vrr  G  f).  (19) 
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Now  let  m{x*)  denote  the  number  of  probes  i  =  1, ...  ,n  for  whieh  =  x*.  Due  to  the 

faetorization  property  (10),  the  log-likelihood  funetion  logn”_^  qo,,e  (T^*) ))  ean  be  written  as  a  sum  of 

^Oi)=  ^  "i(a;*)logp*(rE*),  (20) 

with  a  term  that  is  independent  of  a.  Thus,  for  the  purposes  of  obtaining  an  ML  estimate  of  a,  we  need  only 
eonsider  C{a).  We  refer  to  C  as  the  ineomplete  data  likelihood  funetion.  Note  that  the  term  in  m(Lf )  makes 
no  eontribution  to  C  sinee  n(u*)  =  and  henee  Pq(u*)  =  1.  Henee  the  sum  in  (20)  ean  be  restrieted  to 

ni  =  n*\{u*}. 


Example:  2  leaf  tree  with  MAR  data.  We  now  give  an  example  to  show  how  the  eomplete  data  MLE, 
applied  to  those  probes  for  whieh  eomplete  data  is  available,  generally  produees  an  ineonsistent  estimate  of 
the  link  probabilities  in  the  MAR  ease.  Consider  a  two  leaf  tree  where  data  is  MAR  from  the  right  leaf;  the 
probability  of  missingness  thus  depends  on  the  data  observed  at  the  left  leaf.  The  leaf  probabilities  obey: 


g(ll)  =  q;iq;2Q;30(11),  (?(10)  =  q;iq;2Q;30(1O),  (?(01)  =  q;iq;2Q;30(O1),  (21) 

g(00)  =  (ai  +  Q;ia2«3)0(OO),  (?(lu)  =  q;iq;20(1u),  (?(0u)  =  (1  —  q;iq;2)0(1u)  (22) 

Using  the  four  instanees  of  (19),  namely, 

x  =  ll:  6(11)  +  ^(lu)  =  1;  x  =  10:  ^(lO)  +  ^(lu)  =  1;  (23) 

x  =  01:  ^(Ol)  +  ^(Ou)  =  1;  rr  =  00  :  0(00)  +  0(Ou)  =  1;  (24) 

(21)  reduees  to 

g(ll)  =  q;iq;2Q;30(11),  g(10)  =  Q;iQ;2a30(ll),  g(01)  =  Q;ia2«30(Ol),  (25) 

g(00)  =  (ai  +  Q;ia2«3)0(Ol),  9(lu)  =  q;iq;20(11),  9(0u)  =  (1  —  q;iq;2)0(O1)  (26) 


Now,  the  eomplete  data  MLE  based  on  the  eorresponding  eomplete  data  empirieal  probabilities 
g(ll),g(10),^01),g(00)  is 

^  ^  (g(ii)+g(io))(g(ii)  +  ^oi))  ^  ^  gjn)  ^  ^  gjn) 

^  g(ll)(g(ll)+g(10)  +  g(01)+g(00))’  '  g(ll)+g(01)’  ^  g(ll)+g(10) 

In  the  MCAR  ease,  all  9{x*)  appearing  in  (25)  would  be  equal,  and  substituting  g  for  qin  (27)  one  reeovers 
Qfi,  02,  0:3:  the  estimator  is  eonsistent.  But  in  the  general  MAR  ease  one  obtains  only 


ai(a20(ll)  +020(01)) 
01020(11)  +  (1  -  0102)0(01) 


020(11) 

020(11)  +020(01) 


i.e.  only  estimation  of  03  is  eonsistent. 
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3.2  Application  of  the  EM  algorithm 

We  can  in  principle  estimate  the  link  probabilities  a  by  the  incomplete  data  MLE  a  =  argma}^£(Q;)  in 
(20)  calculated  from  the  counts  of  incomplete  outcomes  m  =  {m(af )  :  x*  G  f^o}-  However,  we  have  been 
unable  to  obtain  a  direct  solution  to  the  incomplete-data  likelihood  equation.  Instead,  we  employ  a  standard 
statistical  method,the  Expectation-Maximization  (EM)  algorithm  [7],  to  obtain  an  iterative  approximation 
I  =  0, 1, ...  to  a  stationary  value  of  the  incomplete  data  likelihood.  The  algorithm  comprises  the 
following  steps: 

(i)  Initialization.  Pick  some  initial  link  probabilities  .  This  could  be  done,  e.g.,  by  setting  =  a, 

the  complete  data  MEE  determined  from  the  counts  of  complete  outcomes  m  if  these  are  non-zero. 
When  complete  data  is  not  available,  we  can  use  the  fact  (see  the  proof  of  Theorem  5  in  [2])  that = 
^A:  +  0(||af)  to  approximate  ofA;  =  -lkhf(k)  =  (1 -7a:)/(1 -7/(A:))  ~  1 +  7A: -7/(A:)- 

This  suggests  the  initial  value 

=  1  +  %  -  7/(A:)  •  (29) 

(ii)  Expectation.  Eor  each  find  the  conditional  expectation  of  the  complete  log-likelihood  given  the 

incomplete  data  Q(c/,  =  ^^{i)[Cc{a')  \  m]. 

(hi)  Maximization.  Eind  the  maximizer  of  the  condition  expectation:  =  argmax^/Q(Q;^ 

(iv)  Iteration.  Iterate  steps  (ii)  and  (iii)  until  some  termination  criterion  is  satisfied. 

Eor  k  E  V,  define  fhe  conditional  probabilities  for  a  probe  fo  reach  R{k)  as 

%,a  =  ^a[\ljeRik)Xk  \  m].  (30) 

Eor  nofafional  convenience  we  write  fhe  conditional  probabilify  ^  ^(i)  derived  from  fhe  iferafe  6^^^  as  7^^^ . 

Theorem  3  Assume  G  Q.  Then 

(31) 

provided  that  lies  in  A. 

We  now  invesfigafe  fhe  question  of  convergence  of  fhe  iterates  .  Whereas  fhe  complete  dafa  like¬ 
lihood  funclion  can  be  shown  fo  derive  from  a  sfandard  exponenfial  family  (see  fhe  proof  of  Theorem  2), 
fhe  incomplefe  dafa  likelihood  funclion  derives  only  from  a  curved  exponenfial  family.  Thus  we  cannof  use 
resulfs  based  on  sfandard  exponenfial  families  (see  e.g.  [22])  alone  fo  conclude  convergence  of  fo  a.  We 
now  esfablish  condifions  under  which  fhe  sequence  exisfs  in  A  and  converges  fo  fhe  MEE  for  fhe  incomplefe 
dafa  problem. 

Theorem  4  Assume  G  Q  and  G  A  for  all  1. 

(i)  converges  to  some  limit  L. 


I'i 


(ii)  If  {a  E  A  \  /^(a)  =  L}  is  discrete,  converges  to  some  a*  that  is  a  stationary  point  C,  i.e. 

(Hi)  If  C  is  unimodal,  dl>^  converges  to  the  incomplete  data  MLE  a. 

3.3  Calculation  of  the  EM  iterates 

An  algorithm  for  the  calculating  JC^iy)  for  a  given  y  E  G  has  been  detailed  in  [2].  It  remains  then  to  provide 
an  algorithm  for  the  calculation  of  the  %.  Let  uq  =  n  —  m*{u*)  denote  the  number  of  probes  for  which  the 
data  is  not  entirely  missing.  Observe  that 

E  —  (32) 

no 

where 

=  ^a[yjeRikAk\X*  =  X*].  (33) 

Let  R{k,x*)  =  {j  G  R{k)  \  tj{x*)  =  1}  denote  the  receivers  descended  from  k  from  which  data  is 
observable.  Let  h{k,x*)  denote  the  closest  ancestor  h  of  k  for  which  a  packet  has  been  observed  to  reach  at 
least  one  descendant  leaf,  i.e., 

h{k,x*)  =  inf{j  :  y*  =  1},  (34) 

jtk 

where  When  k  -<  j,  let  d{j,  k)  denote  that  child  of  j  that  is  an  ancestor  (or  possibly  equal 

to)  k,  i.e.,  d{j,  k)  =  {i  E  d{j)  :  i  E  h}. 


Theorem  5  When  y^  =  1,  fk,a{x*)  =  1-  When  y^  <  1, 


'  ’  '  k^i^d{k,h)  \j(zd{i)\d{i,k)j 

where  h  =  h{k,  x*),  and  for  k  ~<h, 


t>k  =  Pa[\/jeRik)Xj  =  0  I  =  1],  Ck 


1,  ifR{k,x*)  =  $ 

^ a['^ j£R(k,x*)Xj  0  I  1] 

Otherwise 


(35) 


(36) 


Remark:  it  was  found  in  [2]  that  the  problem  of  determining  the  a  for  a  tree  with  complete  data  factors  into 
the  problem  of  solving  a  set  of  depth  two  tree  inference  problems,  one  for  each  node  k  E  V  \  R.  For  each 
leaf  k  one  constructs  the  logical  tree  with  root  p  having  single  child  k,  and  d{k)  leaf-children.  Furthermore, 
for  a  general  tree,  the  problem  could  be  mapped  onto  that  for  a  binary  tree  by  the  insertion  of  lossless  links. 
However,  this  method  cannot  be  applied  when  there  is  missing  data.  This  is  because  the  form  (35)  for 
includes  variables  from  receivers  other  than  those  descended  from  k. 
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3.4  Topology  and  data  conditions 

Theorems  3  and  4  required  the  iterators  to  lie  in  the  domain  Q.  In  this  seetion  we  speeify  eonditions 
on  the  data  in  order  for  these  requirements  to  hold.  In  some  eases  where  the  eonditions  do  not  hold,  it  is 
possible  adjust  the  problem  by  passing  to  one  or  more  subtrees  of  the  original  tree,  for  whieh  the  eonditions 
do  hold.  The  requirements  for  Theorems  3  and  4  are  then  fulfilled:  see  Lemma  1  below.  The  eonditions 
deseribes  here  are  similar  to  those  applied  in  the  ease  of  eomplete  data  in  [2] . 

Non-identifiable  subtrees.  Order  the  elements  of  the  set  {0, 1,  u}  as  u  <  0  <  1  and  extend  the  usual 
maximum  operator  V  on  {0, 1}  to  an  operation  V*  on  {0, 1,  u},  respeeting  the  order  in  an  obvious  manner. 
For  a  given  realization  (2l  ,  T)  of  the  single  probe  and  missing  data  proeess,  define  fhe  quantifies 

=  (37) 

i.e.,  fhe  exfended  maximum  of  Xj{T)  over  all  reeeivers  j  deseended  from  k.  fakes  fhe  value  u  if  all  dafa 
from  R{k)  on  a  given  probe  is  missing,  1  if  a  probe  was  observed  fo  reaeh  af  leasf  one  reeeiver  in  R{k),  and 
0  ofherwise.  We  firsl  eliminafe  from  eonsiderafion  subfrees  on  whieh  no  dafa  is  missing  buf  whose  leaves 
were  reaehed  by  no  probes.  For  k  E  V,  lef  denofe  fhe  evenf  fhaf  for  some  probe  i,  (T^*))  /  0  for 
some  j  G  R{k).  We  will  assume 

Dk  oeeurs  for  all  A:  G  F  (38) 

If  (38)  does  nol  hold,  fhe  following  proeedure  reduees  fhe  inferenee  problem  fo  one  for  whieh  if  does.  If 
fails,  we  remove  from  furfher  eonsiderafion  fhe  subfree  T {k)  roofed  af  k.  If  fhis  pruning  leaves  fhe  parenf 
f{k)  wifh  only  one  offspring  j,  fhe  remaining  free  is  no  longer  a  logieal  mulfieasf  free.  To  make  if  so  we 
remove  fhe  link  {f{k),j)  and  identify  fhe  nodes  j  and  f{k).  The  eonsequenee  is  fhaf  we  will  only  able  fo 
identify  fhe  eharaeferisfies  of  fhe  eomposife  link  joining  j  fo  f{j)  of  fhe  original  free.  Performing  fhese 
operafions  for  all  k  af  whieh  fails,  we  obfain  a  free  for  whieh  (38)  holds. 

In  general,  if  is  nof  possible  fo  affribufe  a  fransmission  probabilify,  even  of  zero,  fo  individual  links  in 
T{k),  sinee  we  eannof  distinguish  fhe  link  or  links  wifh  zero  fransmission  rafe.  An  exeepfion  fo  fhis  is  when 
Dk  fails  for  a  leaf  node  k,  buf  Z?/(A;)  holds  af  fhe  parenf  node  f{k).  In  fhis  ease  we  may  esfimafe  0^  =  0. 
Exeepf  in  fhis  ease,  we  flag  all  {aj  :  j  :<  k}  as  unknown. 

Links  with  perfect  transmission.  Let  D'^,  denote  the  eomplement  of  the  event  =  l,Vj  G 

R{k),i  =  1, . . . ,  n}.  When  D'f,  fails,  lossless  transmission  is  reported  for  all  probes  to  all  reeeivers  in  R{k). 
The  effeet  is  to  position  JC{%)  on  the  boundary  of  A,  sinee  it  follows  that  JCj{^a)  =  1  for  all  j  G  R{k). 
Although  this  is  not  a  problem  for  eomputation,  it  takes  us  out  of  the  domain  of  applieation  of  Theorems  2, 
3  and  4.  The  formal  applieation  of  these  results  requires  that 

D'l^  holds  for  all  keV.  (39) 

If  (39)  does  not  hold,  the  following  proeedure  reduees  the  inferenee  problem  to  a  set  of  one  or  more  for 
which  it  does.  When  D'f,  fails,  we  set  aj  =  1  for  all  nodes  j  G  V (k),  and  omit  these  nodes  from  further 
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consideration.  We  then  spawn  a  set  of  separate  inference  problems  by  forming  the  set  of  subtrees  not 
containing  k  that  are  rooted  at  ancestors  of  k.  This  is  the  set  of  trees  {T{j,i)  \  j  >-  k;  i  E  d{j),i  )f-  k}, 
where  has  vertices  {j}  LlV{£)  and  links  {{j,i)}  U  L{£). 


Model  Conformance.  We  also  need  a  condition  to  ensure  that  estimated  quantities  7  lie  in  Q.  Let 
be  the  event  that  k  has  children  j,i  G  d{k)  such  that  =  1  and  /  0.  Without  this 

condition,  probe  losses  on  different  subtrees  descended  from  k,  conditional  on  the  probe  having  reached  k, 
are  correlated.  This  is  because  each  probe  is  observed  on  no  more  than  one  such  subtree.  Henceforth  we 
assume 

D'l  holds  for  all  any  k  eV  \  R.  (40) 

If  D'^  fails,  we  adjust  the  tree  by  removing  the  link  {f{k),  k)  from  the  tree  and  identifying  its  endpoints 
k  and  f{k).  In  the  original  tree,  we  will  only  be  able  to  identify  the  characteristics  of  the  composite  links 
joining  f{k)  to  the  children  d{k).  The  procedure  is  iterated  if  necessary  until  (40)  holds. 

Conditions  (38)  and  (40)  enable  us  to  fulfill  some  assumptions  in  Theorems  3  and  4.  We  will  henceforth 
assume  that  they  hold. 

Lemma  1  When  (38)  and  (40)  hold,  %  G  Q  for  any  a  E  A. 


3.5  Example:  the  two-receiver  tree 


In  the  simplest  case  we  can  establish  unimodality  of  /f,  directly,  and  thus  conclude  convergence  of  the  EM 
iterates  to  the  incomplete  data  MLE.  Consider  the  two  receiver  tree  with  root  p  having  a  single  child  1 
whose  children  are  the  leaf  nodes  2  and  3.  In  the  two  receiver  tree,  we  enumerate  =  {11,  01, 10,  00}  and 
Hq  =  {11,  01, 10,  00,  lu,  ul,  Ou,  uO}.  It  is  not  difficult  to  determine  that  the  fk,a{x*)  are  as  given  by  the 
following  table: 


rr* 

7i,a(a;*) 

%a{x*) 

11 

1 

1 

1 

10 

1 

1 

0 

01 

1 

0 

1 

00 

0 

0 

0 

lu 

1 

1 

03 

ul 

1 

012 

1 

Ou 

uO 

0 

aia2  0i?, 

Qi+q;iQ2 

aia2<y.?. 

-\-a1a2 

0 

ai+aias 

ai+aias 

These  yield 

Ct'i  CKo 

no72  a  =  to(11)  +  to(10)  +  to(1u)  +  to(u1)q!2  +  to(uO)^ - ^ 

’  ai  +  Q!iQ!3 

OL  1  CK9  CK9 

no 73  a  =  to(11)  +  to(01)  +  to(u1)  +  to(1u)q!3  +  to(Ou)^ - ^ 

’  Q!i  +  Q!iQ!2 

no(72.a  +  73. a  “  7i.a)  =  rn(ll)  +  to(1u)q!3  +  to(u1)q!2 


The  EM  iterates  {}Ci{%),  /C2(7a),  ^3(7a))  of  (ai,  02,  03)  are  then 


/Cl  (7a)  = 


72,a73,a 


X  72, a  +  73, a  -  71, a  x  72, a  +  73,a  “  7l,o 

/C2(7a)  =  ^ ICzi'ya)  =  — - ^ 

72,a  +  73,a  “  7l,a  73,a  12, a 
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Theorem  6  In  the  two-receiver  tree,  the  incomplete  data  likelihood  function  £  is  unimodal,  and  hence 
converges  to  the  incomplete  data  MLE  provided  that  G  A  for  all  1. 


4  Network  Inference  Algorithm 


In  order  to  carry  out  inference  on  measured  data,  we  express  the  calculation  of  7  in  Theorem  5  as  an 
algorithm.  We  start  by  constructing  bh ,  Ch  and  recursively.  Clearly  the  bh  satisfy 


h  = 

The  Ch  satisfy  a  similar  recursion: 


Ck  = 


ah  +  ah  Yijedik) 


h 

oik  +  Oik  Yijedik)  Oj, 


k  E  R, 
kEU\R 


k  E  R,  =  u 

k  E  R,  xl  A  ^ 

kEU\R 


(41) 


The  y*p,  satisfy  the  recursion 


Vk  = 


X 

V* 


A:’ 


k  E  R 
kEV\R 


(42) 


(43) 


■  jedikihj  ’ 

We  formally  specify  an  algorithm  for  the  calculation  of  the  ^  q,  in  Figure  1.  The  main  procedure 
comprises  two  phases.  In  the  first  phase,  set_ybc,  calculates  the  y|,  bh  and  Ch  passing  up  the  tree  from  the 
leaves.  The  second  phase,  set_g,  then  calculates  the  traversing  the  tree  from  the  root  p  downwards. 
hk  plays  the  role  of  d{k,  h)  while  e  plays  the  role  of  05  njed(i)\{<t(*  k)}  ^  given  path  down  the  tree, 

flag  =  1  until  a  node  k  with  =  0  is  first  encountered,  flag  =  0  on  all  calls  to  set_g  below  k.  The 
identity  of  the  node  h{i,  x*)  is  then  maintained  in  calls  at  nodes  i  below  the  child  j  of  k  (lines  10-13). 

We  note  there  is  some  redundancy  in  the  algorithms,  which  can  be  avoided  in  implementations,  (j;  and 
Ch  need  not  be  calculated  at  nodes  k  for  which  =  1,  since  these  values  are  not  used.  The  Ch  depend 
only  on  the  missing  data  indicator  t{x*),  and  so  need  be  calculated  once  for  each  set  incomplete  outcomes 
{x*  G  kl*  :  t{x*)  =  t}  having  the  same  missing  data  indicator  t.  The  do  not  depend  on  x*,  and  so  may  be 
calculated  once  in  advance;  in  particular  bh  =  Ck  when  x*  has  no  missing  data,  i.e.,  when  /  u  VA:  G  i?. 
Lastly,  the  need  only  be  calculated  once  for  each  probe  with  distinct  af,  and  once  at  the  start  of  the 
sequence  of  iterations. 


5  Identifiability  and  Missing  Data 

We  address  the  question  of  identifiability,  i.e.,  whether  there  exists  a  unique  set  of  model  parameters  giving 
rise  to  a  given  distribution  of  observable  data.  The  multicast  inference  method  exploits  correlations  between 
end-to-end  measurements  on  intersection  paths.  Conversely,  we  expect  that  if  the  sets  of  receivers  on  which 
data  from  a  given  probe  is  observable  are  insufficiently  rich,  it  will  not  be  possible  to  infer  the  loss  rates  on 
all  links.  We  give  below  a  simple  example  that  demonstrates  this.  In  this  section  we  shall  derive  conditions- 
between  the  topology  and  the  subsets  at  which  data  is  observable-that  must  be  satisfied  in  order  fhaf  fhe 
model  paramefers  can  be  idenlified. 
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1.  procedure  main(T,  a,  a;*) 

2.  set_ybc(T,  a,  a;* ,  p); 

3.  set^(T,Q!,  l,p,p,  1); 

4.  return({pi;  :  k  £V})', 

1.  procedure  set_ybc(T,  a,  a;*,  A:) 

2.  if  {d{k)  ==  0)  then  { 

3-  2/1:=  5; 

4.  bk  ■■=  ak] 

5.  ==  Li)  then  {cu  '■=  1;  } 

6.  elsejc*  :=  S*;  } 

7.  } 

8.  else{ 

9.  foreach(j  €  (/(A:)  { 

10-  {yj,bj,Cj)  :=  setjyhc{T,a,x*,j); 

11.  } 

12.  yl~^.y*- 

13.  bk  '■=  ak  +  cx.k  Yljed(k) 

14.  Ck  ■■=ak  + akUjed(k)^j'^ 

15.  } 

16.  return{yl,bk,Ck); 

1.  procedure  set_g(T,  a,  e,  A:, /i,  flag) 

2. 

3-  if  (2/I  ==  1)  then  { 

4.  9k  ■■=  1; 

5.  foreach(j  €  d{k)){ 

6.  set_g(T,Q!,l,  j,p,  1); 

7.  } 

8.  } 

9. 

10.  else{ 

11.  if  (flag  ==  1)  then  {/li;  :=  A:;  } 

12.  else{hk  ■=  h\ } 

13.  } 

14.  Qk  ■=  e{ck  -  bk) ! Ch^\ 

15.  foreach(j  €  d{k)){ 

16.  set^{T,a,eak  n*Gd(*)\{i}  c*, 

17.  } 

18.  } 


Figure  1:  Algorithms  for  determining  %,a{x*),  as  returned  from  the  proeedure  main. 
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Consider  a  parameterized  family  of  distribution  {P^  :  (f)  E  with  veetor  parameter  cf),  and  let  F  be 
some  funetion  on  We  say  that  identifies  F{(f))  when  =  P^/  implies  F{(f))  =  F{(f)').  Here,  F  will 
be  the  identity,  or  some  other  projeetion  of  eomponents  of  cf).  In  an  MAR  model,  Pa^e  identifies  (a,  0)  iff 

QaA^*)  =  (la', 9'  {x*),yx*  G  n*  ^  (a,  6)  =  (a',  6').  (44) 

A  simple  example  of  MCAR  dafa  fhaf  is  nol  identifiable  is  a  fwo  leaf  free  in  whieh,  for  eaeh  probe  indepen- 
denfly,  dafa  is  missing  from  exaefly  one  leaf.  Then  fhe  only  non-frivial  equafions  (17)  beeome 

9a,0(lu)  =  q;iq;26'(1u),  qa,0(Ou)  =  (1  -  q;iq;2)6'(0u)  (45) 

9a,0(ul)  =  q;iq;36'(u1),  qa,0(uO)  =  (1  -  q;iq;3)6'(u0) 

The  RHS  of  fhese  equations  are  invarianf  w.r.f.  fhe  Iransformalions  cn  i-)-  kai,  a2  02/ k,  a^/k. 

Wifh  eaeh  S'  C  i?  we  assoeiafe  fhe  minimal  logieal  mulfieasf  free  Ts  =  (Vs^Ls)  fhaf  spans  fhe  roof  p 
and  S.  This  is  obfained  by  firsf  finding  fhe  minimum  spanning  free  of  p  and  S  in  T.  The  braneh  poinfs  in 
fhe  spanning  free,  fogefher  wifh  p  and  S,  form  fhe  node  sef  Vg.  To  define  Lg,  fhe  parenf  fg{k)  in  Tg,  of 
eaeh  node  in  Ug  :=  \  {/?},  is  fhe  >--minimal  j  in  Vg  sueh  fhaf  j  >-  A:  in  T.  A  pafh  in  T  thaf  eonneefs 

fwo  nodes  in  Vg  is  ealled  an  S-segmenf.  Kg{i)  =  {j  E  V  '■  i  F  j  fg{i)}  is  fhe  S-segmenf  ferminafing 
af  f  G  Ug.  Given  i  E  V,  Kg{i)  denofe  fhe  node  in  Vg  fhaf  terminates  the  S-segment  eontaining  i,  i.e,.  that 
for  whieh  i  G  Kg{Kg{i)).  Likewise,  ag{i)  =  rijervsfi)  “i  denotes  the  eomposite  transmission  probability 
along  the  segment  Kg{i).  ag  =  {ag{i)  :  i  G  Ug}  will  denote  the  eolleetion  of  sueh  probabilities. 

Let  Dg  be  the  #Ug  x  ^U  ineidenee  matrix  of  the  nodes  of  U  in  the  segments  of  Tg,  i.e.,  Dgjk  =  1  if 
k  G  Kg{j)  and  0  otherwise.  Setting  (5g{k)  =  logag{k)  and  x/.  =  loga/.  we  have  that 

(3s  =  Dgx  (46) 

for  any  S  C  R.  Before  stating  and  proving  results  on  identifiability,  we  note  that  there  exists  at  least  one 
solution,  log  a,  to  (46).  Let  Pa,0,s  denote  the  distribution  of  the  reports  from  nodes  in  S.  We  give  two 
eonditions  for  identifiability  of  a. 

Theorem  7  Let  T  be  a  canonical  loss  tree  and  ^  a  collection  of  subsets  of  R. 

(i)  U  =  ^f-iUsi  if  and  only  if  the  equations  {/J^.  =  ^  have  a  unique  solution  x. 

(ii)  Assume  Pa, 9,3,  identifies  agi  for  each  i.  Then  {Pa,9,Si}^i  identifies  a  iff  either  (and  hence  both)  of 
the  conditions  of  part  (i)  are  satisfied. 

Remarks.  Uniqueness  of  the  solution  to  (46)  is  determined  by  the  strueture  of  the  Dg,  whieh  depend  only 
on  the  topology  and  the  ehoiee  of  the  S,  not  on  (3$.  Consequently,  when  uniqueness  holds,  it  does  so  for 
any  additive  metrie.  Thus  one  ean  devise  a  test  for  identifiability  based  on  path  length  in  terms  of  number  of 
links.  Furthermore,  if  a  is  not  identifiable,  fhe  proeedure  ean  be  modified  fo  def ermine  whieh  links  ean  be 
solved. 

We  say  fhaf  complete  data  is  available  from  a  subsef  S  if  6{af)  >  0  for  all  x*  sueh  fhaf  R{p,  x*)  =  S, 
i.e,.  for  whieh  reporfs  are  reeeived  from  all  reeeivers  in  S  and  no  ofhers.  Lef  ^  denofe  fhe  sef  of  subsefs 
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S  of  R  for  which  complete  data  is  available,  and  ©c  denote  the  set  of  missingness  parameters  9  for  which 

U  =  UsesUs- 

Theorem  8  Restrict  the  parameter  space  to  AxQc  tmd  assume  data  is  MCAR.  Then  Pa,e  identifies  (a,  9). 

Although  we  do  not  have  a  corresponding  result  for  general  MAR  models,  Theorem  8  is  sufficient  to 
enable  further  analysis  of  simple  models  in  the  following  sections. 


6  Asymptotics  for  Large  Numbers  of  Probes 

Let  a  =  arg  max^£(Q;)  denote  the  incomplete  data  MLE  arising  from  (20).  In  this  section  we  examine  the 
asymptotic  properties  of  a  as  the  number  of  probes  n  grows,  without  specific  reference  fo  fhe  EM  algorifhm. 

Theorem  9  Assume  data  is  MCAR.  The  incomplete  data  MLE  a  is  consistent,  i.e.,  liiiVi_).oo  A  =  a  almost 
surely. 


We  now  describe  fhe  asympfofic  variance  of  a  for  large  numbers  of  probes  n  in  fhe  regime  of  small 
loss  probabilifies  a.  We  calculafe  fhe  expecfed  Eisher  informafion  mafrix  for  fhe  incomplefe  dafa  problem, 
i.e.,  fhe  mafrix  X{a,  9)  =  9)]ij^u^  where  =  —Eg  ^ .  Under  condifions  fhaf  we  esfablish 

below,  fhe  inverse  of  I  (a),  suifably  rescaled,  is  fhe  asympfofic  variance  of  a. 

Our  approach  is  fo  decompose  fhe  Eisher  informafion  mafrix  as  a  sum  over  subfrees  for  which  complefe 
dafa  is  presenf  af  fhe  leaves.  In  fhe  original  incomplefe  dafa  problem  for  fhe  logical  mulficasf  fopology  T, 
fhe  counfs  ns  =  {m{x*)  \  R{p,  x*)  =  S'},  for  each  S  C  R,  can  be  considered  as  a  sef  of  counfs  of  complete 
oufcomes  on  Ts  sfemming  from  fhose  probes  for  which  reporfs  were  received  only  from  nodes  in  S.  Thus 
fhe  incomplefe  dafa  log-likelihood  function  can  fhen  be  decomposed  as  follows: 


^(«)  =  E 

SCR:Sj^& 


^c{Ts,ns,as),  where  Ec{Ts,ns,as)  =  ^  m{x*)logpas{x*s)  (47) 


X*  :R(p,x*)=S 


and  x*g  =  {x^  :  k  E  S}  is  fhe  dafa  in  x*  fhaf  is  observable  af  S.  The  corresponding  decomposition  of  fhe 
expecfed  Eisher  informafion  mafrix  is 


Tfi^ia) 


E  E 

ScR:Sj^^k£eUs 


das{k)  das{tj 
dui  doj 


(48) 


where  X^{as)  =  —  E ^ (fc ^ ~  nP[R{T^^'>  =  S]  be  fhe  mean  number  of  probes 

wifh  dafa  observable  exacfly  af  S,  and  Ws{i)  =  'YjjeKs(K,s(i))  “i  of  link  loss  rales  on  fhe  S-segmenl 

conlaining  i. 


Theorem  10  (i)  When  9  G  Oc  and  hence  when  a  is  consistent,  ^/n{a  —  a)  converges  in  distribution  to 

a  mean  zero  multivariate  Gaussian  random  variable  with  covariance  matrix  rr^X{a). 

(ii)  When  data  is  MCAR,  T?^  =  J2scR:S^9  t^fiy^Ks(i)Ks(i)  +  *^(1) 
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Example:  uniform  report  transmission.  Let  reports  be  transmitted  independently  with  uniform  proba¬ 
bility  p  G  (0, 1].  Then  Ns  =  For  eaeh  S  C  R,  and  node  i  G  Us,  let  Cs{i)  denote  the 

matrix  on  U  with  entries  =  IjWsii)  if  i,j  G  Ks{i)  and  0  otherwise.  For  s  G  #R}  let 

=  'I2s:#s=s  '^teUs  Then 

#R 

X  =  nC- (1 +  0(a)),  where  C  =  ^p*(l  -  (49) 

S  =  1 

Let  Pk  denote  the  orthogonal  projeetion  onto  the  nullspaee  of  a  symmetrie  matrix  K,  and  reeursively  define 
matriees  Ki, by  Ki  =  Ci,  and 

Ks  =  Pki+...Ks.iCsPki+...Ks:  Ki  =  Ci-  (50) 

Let  ro  denote  the  minimal  s  for  whieh  Pk^+...+Ks  =  1-  Sinee  =  1,  sueh  a  ro  <  #R  exists. 

Proposition  1  p^°C~^  converges  to  the  pseudo-inverse  of  Kro  as  p  ^ 

Let  I2  denote  the  Fisher  information  arising  from  measurements  on  binary  subsets,  i.e.,  f  is  the  sum  ob¬ 
tained  by  restrieting  (48)  to  binary  subsets  S. 

Proposition  2  (/)  X  >  X2  >  0,  and  hence  0  <  in  the  order  of  positive  linear  operators. 

(ii)  Proposition  1  holds  with  ro  =  2. 

Thus  we  have  established: 

Proposition  3  Assume  independent  report  loss  with  uniform  probability  p.  Then,/n{a  —  a)  converges  to  a 
multivariate  Gaussian  random  variable  with  mean  zero  and  covariance  G{a,p),  where  liu^^Qp^G{a,p)  = 
KI2  +  0(||a|p)  as  a  — )■  0,  where  KI2  is  the  pseudo-inverse  of  K2. 


Remark:  Proposition  3  suggests  that  we  approximate  the  varianee  of  by  {K I2) kk / whenp  and 

a  are  small,  and  n  is  large. 


Example:  uniform  report  transmission  from  binary  trees  Consider  the  family  of  binary  trees  with  X 
leaves,  r  =  1,2,...,  with  small  uniform  link  loss  probabilities  a  and  uniform  small  report  transmission 
rate  p.  Let  v{r)  =  (ri(r), . . .  ,rr-i-i(^))  denote  the  set  of  unique  diagonal  elements  of  KR/a,  the 
element  determining  the  asymptotie  varianee  on  links  j  nodes  away  from  the  root.  Using  Mathematiea  [16] 
to  perform  the  algebra,  we  found  the  first  six  v{r)  to  be: 


r;(5)  =  { 


/  ^  rl  21  2, 

33 


,  3  49  25  3, 

^  “  ”^80’  240’  42’  7^’ 


m  =  /  J_  ii  ^  ^  4 

’  ”^96’  672’  112’  144’  9^’ 


A  ^  ^  Ar 

1792’  1792’  64’  80’  55’  lU’  ^  ^ 


187  133  153 


4096’  36864’  5760’  1760’  264’  312’  13 


73  209  6 

0^2/1’  010’  I0J 


(51) 


In  all  eases  the  estimator  varianee  rises  in  a  given  tree  on  moving  away  from  the  root,  exeept  falling  slightly 
at  a  leaf  link  as  eompared  with  its  parent.  At  a  given  distanee  from  the  root,  the  link  varianee  deereases  as 
the  tree  depth  inereases.  Both  this  trends  ean  be  understood  by  the  intuition  that  varianee  should  deerease 
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Figure  2:  4-RECElVER  BINARY  TREE:  used  in  model  based  simulation  of  Seetion  8.1. 


when  data  is  available  from  larger  subtrees  below  a  given  link  of  interest.  Considering  the  root  and  leaf  links 
only,  the  values  in  (51)  are  eonsistent  with  the  forms 


vi{r) 


r  1 
r  +  2  22(»--i)  ’ 


Vr+l{r) 


r 

2r  +  l 


(52) 


7  Convergence  Rates  for  the  EM  iterates 

We  now  eonsider  eonvergenee  of  the  EM  iterates  themselves.  Let  A4  denote  the  map  on  that  implements 
the  iteration,  i.e.,  sueh  that  =  Ad(a^^^).  A  Taylor  expansion  of  the  iterative  map  gives 

$(^+1)  _  a  «  VM  ■  -  a)  (53) 

where  WMij  =  is  the  gradient  of  M..  A  standard  result  [17,  §3.9.3]  expresses  VAd  =  (1  — 
with  If.  the  eomplete  data  information  matrix  and  I  the  ineomplete  data  information  matrix  from  (48).  The 
eonvergenee  ratio  of  the  iteration  is  taken  as  the  maximum  eigenvalue  A  for  V Ad. 

Our  analysis  of  the  eonvergenee  ratio  is  eonfined  to  the  regime  treated  in  Seetion  6,  namely  that  of 
independent  report  transmission  with  small  probability  p,  and  small  link  loss  probabilities”Q;.  In  this  regime, 
we  have  seen  that  =  n~^{aiSij  +  0(||a|p)  and  so  from  (49)  VM{a)ij  =  Sij  —  aiCij  +  0(||a|p). 

Let  8 (A)  denote  the  set  of  eigenvalues  of  a  matrix  X. 

Proposition  4  Assume  independent  report  loss  with  probability  p  G  (0,1)  and  small  uniform  probe  loss 
rate  a.  The  convergence  rate  Xfor  the  EM  algorithm  obeys  A  =  1  —  +  0{p^{p  +  a)),  where  k  is  the 

minimum  non-zero  eigenvalue  oflxK2. 


8  Experiments  and  Simulations 

In  order  to  evaluate  the  performanee  of  the  missing  data  inferenee  algorithm,  we  eondueted  two  types  of 
simulation.  Lirst,  we  used  model-based  simulation  in  whieh  the  model  for  missing  data  indieators  eonformed 
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with  the  MCAR  property.  Second,  we  used  a  network-based  implementation  of  the  RTCP-based  reporting 
mechanism  outlined  in  Section  2.5.  In  this  case  the  missing  data  indicators  are  not  known  to  conform  to 
the  MAR  model.  This  enabled  us  to  test  robustness  of  the  algorithm  with  respect  to  violations  of  the  MAR 
hypothesis  that  might  occur  in  a  real  network  application. 

8.1  Model-based  simulation 

We  conducted  model  based  simulations  on  a  balanced  binary  tree  with  4  receivers,  illustrated  in  Figure  2. 
Probe  losses  were  independent  with  a  uniform  loss  rate  per  link.  Receiver  reports  were  generated  at  each 
receiver  for  each  probe  and  were  transmitted  independently  with  uniform  probability  p.  We  conducted  100 
separate  simulation  runs,  each  of  100,000  probes.  Initialization  of  used  (29).  The  termination  criterion 
for  the  EM  algorithm  was  that  successive  iterates  should  have  an  absolute  difference  of  less  than  ICT^ 
on  each  link  k. 

Figure  3(left)  shows  the  mean  and  error  bars  for  95%  confidence  of  link  loss  rate  estimates  obtained 
using  the  missing  data  algorithm.  We  also  display  the  corresponding  quantities  for  the  complete  data  es¬ 
timator  applied  to  only  those  probes  for  which  complete  reports  were  available.  In  both  cases  the  mean 
estimate  is  close  to  the  model  loss  rate,  i.e.  'a^  =  0.01.  But  note  the  rapid  widening  of  error  bars  for  the  full 
data  algorithm,  compared  with  the  missing  data  algorithm,  as  the  report  transmission  probability  decreases. 
From  Prop.  3  we  expect  the  standard  error  of  the  link  loss  rate  estimates  to  diverge  as  p~^  for  the  missing 
data  algorithm,  regardless  of  the  topology.  However,  in  a  4-leaf  tree  the  number  of  probes  with  complete 
data  is  proportional  to  p^.  Hence  we  expect  the  standard  error  to  diverge  as  p~‘^,  with  faster  divergence  for 
trees  with  more  receivers.  In  this  example,  for  p  less  than  0.4,  the  error  bars  encompass  loss  rate  0:  the 
inferred  loss  from  complete  becomes  statistically  indistinguishable  from  0. 

Figure  3(right)  breaks  down  the  standard  error  of  the  link  loss  rate  estimates  according  to  the  location  of 
the  link  in  the  topology  links  1,  2  and  4  being  representative  of  links  respectively  0,  1  and  2  links  removed 
from  the  root.  The  experimental  standard  errors  show  close  agreement  with  the  theoretical  values  obtained 
by  inverting  the  information  matrix  I  in  (49).  We  also  show  the  small  p  approximation  obtained  using 
Proposition  3  and  the  values  v{2)  from  (51).  The  approximation  remains  reasonable  even  for  quite  large  p. 

8.2  RTCP-based  experiments 

The  RTCP-based  simulations  used  data  gathered  from  a  network-based  implementation  of  loss  reporting. 
Foss  reports  are  embedded  in  RTCP  feedback  packets;  any  collector  listening  to  these  can  then  perform 
inference.  The  basic  RTCP  reporting  mechanism  includes  only  the  average  loss  rate  based  on  sequence 
numbers  of  received  packets.  An  extension  of  the  report  format  allows  the  inclusion  of  a  binary  vector 
indicating  receipt  or  otherwise  of  a  set  of  packets. 

According  to  the  RTP  standard  [20],  the  total  report  volume  over  all  receivers  should  not  exceed  5%  of 
the  source  rate.  RTCP  clients  estimate  their  share  of  this  based  upon  the  reports  they  hear  from  the  other 
receivers,  and  limit  report  frequency  and  size  accordingly.  Consequently,  for  a  sufficiently  large  number  of 
receivers,  it  will  not  be  possible  to  report  on  all  probes.  Missingness  arises  then  by  two  mechanisms:  the 
omission  of  certain  probes  from  reporting,  and  the  loss  of  report  packets  during  transmission  to  the  collector. 
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Figure  3:  VARIANCE  OF  MISSING  Data  Estimator  in  4-receiver  uniform  probe  and  packet 
LOSS  MODEL:  Over  100  simulation  runs  eaeh  of  100,000  probes,  uniform  link  loss^a  =  0.01,  probe  trans¬ 
mission  rate  p  from  0.1  to  0.9.  Left:  mean  estimate  with  error  bars  for  95%  eonfidenee.  Comparison 
with  estimator  using  only  probes  with  eomplete  data.  Right:  standard  error  depending  on  link  loeation: 
experiment,  theory  and  approximation. 
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The  implementation  of  extended  RTCP-based  reporting  used  in  this  study  has  a  simulation  mode  that 
enables  it  to  report  on  paeket  losses  generated  on  a  model  topology  aeeording  to  a  Bernoulli  loss  model, 
rather  than  due  to  paeket  loss  in  a  real  network.  The  probe  souree  was  ehosen  to  have  the  eharaeteristies 
of  a  GSM  audio  stream  that  eould  aet  as  a  probe  souree  in  real  networks,  sending  paekets  at  a  rate  of  50 
per  seeond.  Sinee  probe  losses  follow  the  assumed  statistieal  model,  only  the  missing  data  indieators  ean 
potentially  exhibit  departures  from  our  model  assumptions.  Report  thinning  and  transmission  then  takes 
plaee  in  the  manner  deseribed  above. 

We  eolleeted  traees  from  a  32  reeeiver  balaneed  binary  tree  for  whieh  the  link  loss  rates  were  ehosen  in¬ 
dependently  with  a  uniform  distribution  between  1%  and  10%.  The  traee  eomprised  reports  on  11,956  probe 
paekets,  eneompassing  about  4  minutes  at  50  paekets  per  seeond.  The  mean  number  of  reports  reeeived  for 
a  given  probe  was  18.8,  so  that  the  proportion  of  missing  reports  was  1  -  18.8/32  =  0.413.  The  maximum 
number  of  reports  per  probe  was  29,  i.e.  no  probe  had  eomplete  data.  Figure  4(left)  shows  a  seatter  plot  of 
the  63  pairs  of  (aetual,inferred)  loss  rates.  The  agreement  is  quiet  elose,  with  tight  elustering  around  the  line 
of  slope  1  through  the  origin.  The  median  relative  error  over  all  links  was  only  4.5%. 

Figure  4(right)  displays  the  median,  5*^  and  95*^  pereentile  of  the  relative  error  over  all  links  as  a 
funetion  of  the  size  of  a  subset  of  probes  used  for  inferenee.  Note  that  even  with  2000  probes  the  relative 
error  is  typieally  less  than  50%.  Henee  we  ean  expeet  to  identify  the  lossiest  links  with  measurements  over 
a  duration  less  than  1  minute.  The  median  error  is  only  about  13%  for  this  number  of  probes. 
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Figure  4:  INFERENCE  FORM  RTCP  TRACE  DATA  ON  32  RECEIVER  BINARY  TREE.  LEFT:  Scatter  plot  of 
inferred  vs.  actual  loss  rates  for  a  full  trace  of  11,956  probes.  Right:  Median,  5^^  and  95*^  percentiles  of 
relative  error  over  all  links  as  function  of  number  of  probes. 


9  Conclusions 

In  this  paper  we  have  extended  the  multicast  based  method  for  inferring  network  internal  loss  from  end-to- 
end  measurements  that  was  first  proposed  in  [2].  The  original  method  assumed  the  presence  of  complete  data 
specifying  the  set  of  end-points  reached  by  each  multicast  probe.  However,  the  proposed  use  of  the  RTCP 
transport  protocol  to  transmit  measurements  inevitably  leads  to  missing  data,  either  through  the  need  to  thin 
data,  or  due  to  loss  of  reports  in  transmission.  This  motivated  extending  the  former  approach  to  work  with 
missing  data.  An  ad  hoc  approach  of  working  with  subsets  of  complete  data  would  have  several  drawbacks: 
inference  on  all  links  may  not  be  possible;  inference  would  be  inconsistent  under  the  types  of  correlation 
between  probe  data  and  missingness  that  could  reasonably  occur  in  this  context  (see  Section  2.5);  and  the 
estimators  are  not  generally  efficient.  These  considerations  motivated  the  use  of  a  more  generally  applicable 
scheme  that  accommodates  the  missing  data  directly,  under  more  general  conditions  on  the  missing  data 
mechanism. 

This  paper  extended  the  Maximum-Likelihood  approach  of  [2]  to  encompass  missing  data.  We  applied 
the  EM  algorithm  to  generate  an  iterative  approximation  to  the  corresponding  MLE.  We  analyzed  conver¬ 
gence  rates  for  the  EM  algorithm  itself,  and  for  the  MEE  as  the  number  of  probes  grows,  and  showed  how  to 
calculate  these  rates  explicitly  for  a  particular  class  of  models.  We  tested  an  implementation  of  the  algorithm 
in  model-based  simulations  with  known  missingness  statistics,  and  also  in  traces  gathered  from  an  imple¬ 
mentation  of  the  RTCP-based  report  transfer  method.  These  results  showed  (i)  the  reduction  in  estimator 
variance,  as  compared  with  the  ad  hoc  approach,  where  applicable;  and  (ii)  accuracy  of  inferred  loss  rates 
compared  with  model  or  directly  measured  rates  in  the  simulation;  (iii)  robustness  of  the  approach  under 
potential  departures  from  the  model  assumptions  on  the  missingness  statistics  in  the  RTCP-based  applica¬ 
tion.  In  the  RTCP-based  experiments  the  median  estimation  error  on  a  32  receiver  tree  was  only  about  13% 
for  2,000  probes,  and  was  typically  less  than  about  50%.  Thus  inference  sufficiently  accurate  to  identify  the 
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lossiest  links  could  be  performed  on  measurements  collected  over  about  a  minute. 

Future  work  is  planned  in  two  directions.  First,  we  want  to  apply  the  same  general  methodology  to  the 
estimation  of  other  internal  characteristics,  such  as  delay,  utilization  and  topology  itself,  by  adapting  the 
framework  of  the  present  paper  to  work  on  estimation  of  these  quantities  with  complete  data,  as  performed 
in  [3,  9,  15]  A  second  direction  is  to  develop  more  specific  models  of  the  missing  data  mechanism  that  could 
be  used  in  a  parametric  approach  to  estimation  with  missing  data.  Lastly,  we  intend  to  publish  elsewhere 
details  of  the  RTCP  reporting  mechanism  that  motivated  this  study. 

Acknowledgment  We  thank  Francesco  Lo  Presti  for  some  useful  suggestions. 


10  Proofs  of  Theorems 


Proof  of  Theorem  2:  We  first  render  Cc{a)  into  the  canonical  form  of  a  standard  exponential  family. 


•  Denote  by  0  and  1  the  elements  rr  of  with  all  0  or  all  1  respectively. 

•  For  rr  G  n,  denote  by  W'{x)  those  nodes  A:  G  17  for  which  Xj  =  0  for  all  j  G  R{k).  Let  W (x)  be  the 
A -maximal  elements  of  W'{x).  Note  that  W{1)  =  0. 

•  For  each  k  E  U  and  i  G  {0, 1}  define  q^ii)  =  ^a[Xj  =  h'^j  £  =  !]• 

•  Define  new  parameters  {5^  :  k  E  U}  by  5^  =  ^og{qk{^) / Qki^)) ■ 


•  Observe  that 

Pajx)  ^  TT  qkjO) 
k^W[x) 

We  interpret  the  product  over  0  for  rr  =  1  as  1. 


n 

kEW{x) 


(54) 


•  The  map  taking  A  to  its  image  A  under  the  change  of  parameters  a  i-)-  ^  is  invertible.  To  see  this  note 
that  given  ’^xenPai^)  ~  1’  (^4)  fixes  the  Pa{x)  in  terms  of  the  5.  These  in  turn  determine  the 
and  hence  the  a^,  by  Theorem  2. 

•  Writing  n(l)  =  n  —  recalling  that  W (1)  =  0,  we  find 

^c{a)  =  '^n{x)'^  5k+n\ogpa{l)  =  '^Nk5k-nc{5), 

xen  kew(x)  keu 

where  N/.  =  J2xen-kew{x)  ”(^)’  the  reparameterization  of  —  logpQ,(l)  in  terms  of  S: 


c(^)=iog^  n 

xenkew{x) 


(55) 


The  expression  (55)  has  the  form  of  a  standard  exponential  family,  with  the  log-likelihood  expressed  in 
terms  of  the  natural  parameters  5  =  {4  '■  k  E  U},  sufficient  statistics  N  =  {Nf^  :  k  E  U},  and  cumulant 


86 


c(^).  Since  c{5)  is  finite  for  all  ^  G  the  family  can  be  considered  full.  However,  the  parameter  space  of 
interest  is  the  open  subset  A  C  that  is  the  image  of  A  under  the  reparameterization  a  5. 

Since  the  mapping  A  ^  A  :  a  i-)-  ^  is  invertible,  the  parameters  5  are  identifiable  by  Theorem  2(v), 
and  hence  the  exponential  family  is  affinely  independent.  (A  simple  argument  shows  that  natural  parameters 
in  an  open  set  are  identifiable  iff  the  exponential  family  is  affinely  independent).  A  well-known  result  (see 
e.g.  [13,  Ex.  6.6.3.])  for  standard  exponential  families  then  says  that  the  MLE  is  the  solution  S  of 

Nk  =  Es^[Nk],  keU  (56) 

provided  this  5  lies  in  the  interior  of  A.  But  clearly  'Yhjyk  ~  ~  TA:)?  hence  finding  the  solution 

to  (56)  is  equivalent  to  finding  the  solution  d  to 

lk  =  ^a'[%l  keU.  (57) 

Provided  7  G  then  by  Theorem  1,  a  is  the  unique  such  solution,  and  hence  if  it  lies  in  A  it  is  the  MEE.| 

Proof  of  Theorem  3:  Observe  that  Eq,  [£c  (aO  I  logPa'(a;).  Hence  maximizing 

Qgd)  Q,/  over  c/  is  equivalent  to  finding  the  complete-data  MEE,  but  with  n{x)  replaced  by  [n{x)\m] 
throughout.  In  particular,  %,  being  a  linear  combination  of  the  n{x),  gets  replaced  by  Now  if  G  A 
then  it  is  not  hard  to  see  that  (x*)  G  Q  for  each  x*,  and  hence  7^^^  G  Q  since  Q  is  convex.  The  result 
then  follows  from  Theorems  1  and  2.  ■ 

Proof  of  Theorem  4:  The  condition  says  that  the  sequence  of  EM  iterates  exists  in  A.  Parts  (i)  and 

(ii)  follow  from  Theorem  6  of  [22].  This  is  because:  (a)  by  Theorem  l(iii),  is  stationary  for  a  i-)- 

Q{a,  (b)  VaQ{oi' ,  a)  is  clearly  continuous  for  (c/,  a)  ^  A 'x  A',  and  (c)  By  Theorem  2,  Cc  comes 
from  a  regular  exponential  family  and  hence  ||  — )■  0  as  .(  — )■  00;  see  remark  3(vi)  in  [22].  Part 

(iii)  then  follows  from  Corollary  1  in  [22]  g 


Proof  of  Theorem  5:  Eor  the  proof  we  observe  the  following  Markov  property  of  the  probe  process  X : 


(M)  Conditioned  on  Aj  =  1,  the  distributions  of  the  sets  of  variables  {X^  :  k  G  R{i)}  are  independent 
for  different  i  E  d{k). 


If  y*p.  =  1,  then  ^ j^R(k)Xj  =  1,  and  hence  =  1-  Suppose  instead  that  <  1.  Eet  Q{k)  denote 

the  event  that  Xj  =  0  for  all  j  G  R{k,  x*)  if  R{k,  x*)  A  otherwise  take  Q{k)  as  the  universal  set  in  the 
underling  probability  space.  Since  y^  =  1  for  h  =  h{k,x*),  then  {A*  =  rr*}  C  {A/j  =  1}  and  hence 
{A*  =  X*}  =  {Xh  =  1}  n  Q{d{h,  k))  n  {A*  =x*:j^R\  R{d{h,  A:))}.  This  yields  that 


jeR{k)^j  —  1  I  A  —  X  ] 


Pa[{yjeR{k)Xj  =  l}n  Q{d{k,h))  \Xh  =  l] 
Pa[Q{d{k,h))\Xh  =  l] 


(58) 


where  we  have  used  property  (M),  and  the  fact  that  P[A  \  B  C  D]  =  P[A  \  C  \  B]  when  A  and  C  are 
conditionally  independent  of  D  given  B. 
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(59) 


The  denominator  in  (58)  is  just  c^(^h,k)-  To  treat  the  numerator,  observe  that 

^a[{'^jeR(k)Xj  =  1}  n  Q{i)  I  Xf(i)  =  1] 

=  Pa[{^jeR(k)Xj  =  1}  n  Q{d{k,i))  n  {Xi  =  1}  njed(i)\d(k,i)  QU)  I  =  i] 

=  Pa[{^jeR(k)Xj  =  1}  n  Q{d{k,i))  I  X,  =  1  I  Xfd)  =  1]P„[X,  =  1  I  Xfd)  =  1]  n  P„[Q(j)  |  X,  |  Xf^^] 

jEd(i)\d(k,i) 

=  Pa[{'^jeR{k)Xj  =  1}  n  Q{d{k,i))  \  Xi  =  1]  Oj  cj. 

jEd(i)\d(k,i) 

Here  we  have  used  Q{j)  =  f^ied{j)Q {'>')’  the  Markov  property  (M),  and  the  fact  that  {Xf^  =  1}  C 
{Xj(^l.-^  =  1}.  Applying  (59)  repeatedly  to  the  numerator  of  (58),  we  obtain  the  form  jeR{k)^j  = 

1}  n  Q(k)  I  =  1]  Y[k^i<dik,h)  njed(i)\{d(i,A:)} 

The  first  term  is  in  this  product  is  Paii^ jeR{k)Xj  =  1}  n  Q{k)  \  =  1]  =  CkP a[{y jciR{k)X j  = 

1}  I  Q{k)  I  =  l]=ck{l-  Pa[yjeR{k)Xj  =  0  I  =  l]lck)  =  Ck-  bk,  and  the  stated  result  (35) 
follows.  ■ 


Proof  of  Lemma  1:  We  first  show  that  for  each  k  eV  there  is  some  x*  G  fig  with  m{x*)  >  0  for  which 
lk,a{x*)  >  0.  By  condition  (38)  there  is  x*  with  m{x*)  >  0,  for  which  either  rr*  =  1  or  rr*  =  u  for  some 
j  G  R{k).  In  the  former  case  %,a{x*)  =  1-  or  x*  =  u;  in  the  latter  case  it  is  not  hard  to  see  in  Theorem  5 
that  Ck  >  bk  and  hence  %,a{x*)  >  0.  Since  ^k,a{x*)  >  0  for  all  x*  G  fig,  then  %  q,  >  0. 

By  (40)  then  for  each  k  G  L\ii!  there  exists  x*  with  m{x*)  >  0  such  that  ^k,a{x*)  =  1,  and  children  j,  I 
of  k  for  which  =  1  while  %,a{x*)  >  0.  Hence  ^k,a{x*)  <  J2jed{k)  lj,a{x*)-  Since  by  definition 

<  T^jedik)  for  all  a;*,  we  have  <  Y.jed(k) 

Taking  these  relations  over  all  relevant  k,  we  conclude  that  ^  G  | 

Proof  of  Theorem  6:  Reparameterizing  the  incomplete  data  likelihood  function  £  in  terms  of  7,  we 

obtain 

£  =  n(ll)log(72  +73  -71)  +n(10)log(7i  -73)  +n(01)log(7i  -72)  +n(00)log(l  -71) 

+n(lu)  log(72)  +  n(ul)  log(73)  +  n(0u)  log(l  -  72)  +  n(u0)  log(l  -  73) 

Writing  this  form  as  £  =  X]a;*eo*  it  suffices  fo  show  fhaf  —Cx*  is  joinfly  convex  in  71, 72, 73 

for  each  x*.  This  follows  from  fhe  facl,  esfablished  by  direcf  compufafion,  fhaf  fhe  principal  minors  of  fhe 
Hessian  mafrices  —[ACx*/d'jid'jj]ij=i^2,3  are  non-negalive.  Convergence  fhen  follows  from  Theorem  4 
and  fhe  sfanding  assumpfions  (38)  and  (40).  | 


Remark:  fhe  mefhod  used  in  fhe  proof  appears  nol  fo  exfend  fo  more  general  frees,  nof  even  binary  ones. 


Proof  of  Theorem  7:  (i)  We  esfablish  fhaf  if  U  =  IJ^iUsi,  then  there  is  a  unique  solution  x.  We 

do  this  by  contradiction.  Suppose  that  there  exists  a  node  k  E  V  such  that  there  does  not  exist  a  unique 
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value  for  x^.  Pick  any  such  -< -maximal  k.  By  assumption,  k  terminates  some  segment  Ks^ik)  and  hence 
Xk  =  Psi  (k)  —  ^3-  maximality  assumption,  all  terms  on  the  RHS  are  unique,  and  hence 

so  is  Xk- 

We  establish  now  that  if  there  is  a  unique  solution  x,  then  U  =  done  by  contradiction. 

First,  note  if  the  solution  x  is  unique,  it  must  be  Xk  =  log  ak  ■  Second,  note  that  we  only  need  to  consider  a 
branch  point  k  G  U\R.  Assume  there  exists  k  ^  U^iUs^ \R-  Then  every  segment  containing  v  G  d{k)  also 
contains  k.  Using  Xk  =  log  ak  we  can  reduce  the  family  of  equations  {|3s^  =  Ds^x}'^-^^  to  the  following  set 
of  equations  describing  the  behavior  of  Xk  and  Xy,  v  G  d{k), 

Xk  +  Xy  =  Py,  Vn  G  d{k)  (60) 

where  /?'  is  the  log  of  the  probability  of  a  reception  of  a  packet  at  n  G  d{k)  given  that  it  was  received  at 
f{k).  The  jdy  are  determined  by  the  link  probabilities  other  than  and  Uy,  v  G  d{k),  and  the  original  Psi- 
However,  these  equations  do  not  have  a  unique  solution,  resulting  in  a  contradiction. 

(ii)  The  solution  x  to  {/J^.  =  ^  is  not  unique  if  and  only  if  there  is  more  that  one  set  of  link 

probabilities  a  giving  rise  to  the  same  and  hence  the  same  {Pa,0,Si}^i-  | 

Proof  of  Theorem  8:  With  MCAR  the  missingness  probabilities  do  not  depend  on  any  data,  and  so 
we  can  replace  6{x*)  with  0{t{x*)).  Since  J2x*:t{x*)=t'  P  ^  {0,1}^,  qa,0{x*)  = 

Qa',e'ix*),yx*  G  fi*  implies  6  =  6'. 

Consider  now  any  S  C  R  for  which  complete  data  is  available.  Then  9{t{:f))  are  equal  and  hence 
strictly  positive  for  any  x*  with  R{p,x*)  =  S.  Hence  qa,e{x*)  =  qa',e'{x*)  implies Pq, (rr* )  =  Pa'{x*)-  By 
Theorem  l(iv),  ag  is  identifiable.  Then  a  is  identifiable  by  Theorem  7(ii)  since  0  G  ©c.  | 


Proof  of  Theorem  9:  The  proof  of  mirrors  fhaf  of  Theorem  3(ii)  in  [2]  which  in  furn  uses  Lemma  7.54  in 
[19].  Allhough  nol  menlioned  Ihere,  Ihe  laller  resull  requires  idenlifiabilify.  This  follows  from  fhe  MCAR 
assumplion  and  Theorem  8.  g 


Proof  of  Theorem  10:  We  firsl  show  lhal  Ig  is  positive  definite  if  complele  dala  is  available  from  S.  By 
slandard  argumenls  (seee.g.  Prop  2.84  in  [19])  one  writes 

If  X^{ag)  is  nol  positive  definite,  Ihere  exisls  some  nonzero  c  G  for  which  c  •  Ig{ag)  ■  c  =  VarQ,^0(c  • 
~  happens  if  almosl  surely,  or,  equivalenlly,  if  _ 

0  for  all  x*^,  since  6{x*)  >  0  by  assumption.  Repeating  fhe  argumenl  of  Theorem  4(ii)  in  [2],  Ibis  requires 
c  =  0. 

Observe  lhal  =  B{ag)DgB~^{a)  where  B{a)  is  Ihe  diagonal  malrix  wilh  enlries  from  a.  Thus 
X  >  J2seSe  ^~^{^)^^^{^s)Is{ois)B{ag)DgB~^{a),  in  Ihe  order  of  positive  linear  operators.  Since 
Ihe  kernel  of  a  sum  of  non-negative  definite  operators  is  Ihe  intersection  of  Iheir  kernels,  Ihe  sum  is  positive 
definite  iff  f^g^Sg  ker(i95')  =  {0},  which  happens  iff  Ihe  equations  {Pg  =  Dgx}g^Sg  have  a  unique  solution 
X,  which  is  guaranteed  by  Theorem  7(i)  and  Ihe  assumption  lhal  6  G  ©c. 
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(ii)LetS'  G  Sg.  Note  as{k)/ai  =  Ds,ki{'^  +  0{\\a\\)).  With  data  MCAR,  the  0(rE*)  are  equal  for  any  rr* 
with  rr*)  =  S,  and  henee  Em{x*)  =  Nsp^ix*)  =  (rr^).  From  Theorem  5  of  [2],  Ng^Is{as) 

has  inverse  i^s  for  whieh  vf  =  as{k)8ki  +  Odla^f ).  Henee  Xf{as)  =  +  0(||a5||).  Finally, 

observe  that  =  Ws{i)  +  0(||a|p).  Putting  these  together  with  (48)  the  result  follows.  | 


Proof  of  Proposition  1:  Rewrite  C  =  J2f=iP^Cg  where  C'  =  J2i<t<s^t  ■  (—1)*”*  (^It)  ■  Observe  that 
the  definition  of  Kg  is  invariant  under  replaeement  of  Cg  by  C'  in  (50)  sinee  PKi+...+Ks-iCgi  Pki+...+Ks-i  = 
0  for  s'  <  s.  Let  Wg  denote  the  unitary  matrix  that  diagonalizes  Kg  and  set  W  =  01=1  ^s-  (Sinee  the 
ranges  of  the  Kg  are  disjoint,  the  various  Wg  eommute).  Then  up  to  unitary  transformation  under  W,  we 
ean  write  C  is  bloek  diagonal  form 

/  pAu{p)  p'^Auip)  •••  p''°Airo{p)  \ 

^  P^A22{p)  •••  P’’°A2^o(p) 

C  =  W  .  .  .  W  (61) 

V  P’’°Aol(p)  P’’°Ao2(p)  •••  P’’°Aoro(p)  / 


where  Aij{p)  =  Aj-{p),  and  eaeh  submatrix  Aij  eonverges  to  some  Aij  as  p  — )■  0  sueh  that  the  bloek 
matrix  with  Agg  as  the  diagonal  element  and  zero  elsewhere  is  unitarily  equivalent  to  Kg  under  W . 
By  eonstruetion,  the  Agg  are  invertible,  and  henee  so  are  the  Agg{p)  for  suffieiently  small  p.  The  bloek 
diagonal  representation  of  C  ean  be  inverted  induetively  as  follows.  Assume  the  bloek  submatrix 
eomprising  the  first  s  —  1  bloeks  ean  be  inverted  and  that  is  0(p^“*)  as  p  — )■  0.  This  eondition 

is  trivially  satisfied  for  s  =  2.  Now  write  .B[s][s]  as  a  two-by-two  superbloek  matrix 


Bgig_i]  p"Agg{p)  ) 


Then  =  Dg^iip)  ■  Dg^{p)  where 


(62) 


Ds,i{p) 


Ds,2{p) 


y  p-Mj/(p)  J 

(  1— p  0 


(63) 


Now  g  =  0(p*)  as  p  — )■  0,  from  whieh  it  follows  that 


Consequently,  B7\  =  0(p~'^)  as  p  — )■  0,  eompleting  the  induetion  step.  The  statement  of  the  Proposition 

[55J 

then  follows  from  (65)  by  taking  s  =  rg.  | 


Proof  of  Proposition  2:  (i)  In  the  given  model.  Ns  >  0  for  all  subsets  S  of  R,  and  henee  I  >l2.  Sinee 

eaeh  node  inU  C  R  has  at  least  two  deseendent  leaves  U  =  Ll^s=2Us  and  henee  X2  >  0  by  an  argument 
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similar  to  that  in  Theorem  7(i).  Now  if  ^4  >  S  >  0  then  0<1  —  a<l  —  6<1  where  a  =  A/ (2||  j4||)  and 
b  =  BI{2\\A\\).  Sinee  ||l-a||,  ||1-6||  <  1,  a'l  =  (l-(l-a))-i  =  E“o(l-«)*  < 
whenee  the  result  follows. 

(ii)  Consider  inferenee  performed  by  using  only  measurements  from  binary  sets  3.%=  pQi+p^Q2  for 
some  Qi  and  Q2  independent  ofp,  and  henee  is  0{p~‘^).  By  (i),  whieh  preeludes  ro  >  2 

in  Proposition  1.  We  eonelude  ro  =  2  by  showing  that  Ki  has  0  as  an  eigenvalue,  for  then  ro  >  1.  Assume 
that  the  root  p  as  a  unique  ehild  1.  If  not,  partition  the  T  into  disjoint  subtrees  with  nodes  deseended  from 
eaeh  ehild  of  p,  then  apply  the  following  argument  to  eaeh  subtree.  Let  v  denote  the  element  of  l¥  with 
r(l)  =  1,  v{j)  =  —1  for  j  G  d(l),  and  v{k)  =  0  otherwise.  Observe  that  for  eaeh  k  E  R,  the 
equal  for  i,j  G  {1,  d{k,  1)}.  Sinee  Ci  =  J2keR  Ci  •  r  =  0.  | 

Proof  of  Proposition  4:  max£’(VAd)  =  1  —  amin£’(C')  +  0{a)).  From  Propositions  1  and2  weknow 
that  C  takes  the  bloek  diagonal  form  (61)  with  ro  =  2.  From  this  is  follows  that  eaeh  eigenvalue  of  C 
takes  the  formp*rj(p)  +  for  some  i  G  {1,2},  where  \imp^QVi{p)  G  £’(Ajj).  Sinee  0  0  £’(Ajj), 

p~‘^  min £’((7)  — )■  min£’(A22)  asp  — )■  0,  and  henee  the  result  follows.  | 
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Abstract 

Packet  delay  greatly  influences  the  overall  performance  of  network  applications.  It  is  there¬ 
fore  important  to  identify  causes  and  location  of  delay  performance  degradation  within  a  net¬ 
work.  Existing  techniques,  largely  based  on  end-to-end  delay  measurements  of  unicast  traffic, 
are  well  suited  to  monitor  and  characterize  the  behavior  of  particular  end-to-end  paths.  Within 
these  approaches,  however,  it  is  not  clear  how  to  apportion  the  variable  component  of  end-to- 
end  delay  as  queueing  delay  at  each  link  along  a  path.  Moreover,  they  suffer  of  scalability 
issues  if  a  significant  portion  of  a  network  is  of  interest. 

In  this  paper,  we  show  how  end-to-end  measurements  of  multicast  traffic  can  be  used  to 
infer  the  packet  delay  distribution  and  utilization  on  each  link  of  a  logical  multicast  tree.  The 
idea,  recently  introduced  in  [4,  5]  is  to  exploit  the  inherent  correlation  between  multicast  ob¬ 
servations  to  infer  performance  of  paths  between  branch  points  in  a  tree  spanning  a  multicast 
source  and  its  receivers.  The  method  does  not  depend  on  cooperation  from  intervening  net¬ 
work  elements;  because  of  the  bandwidth  efficiency  of  multicast  traffic,  it  is  suitable  for  large 
scale  measurements  of  both  end-to-end  and  internal  network  dynamics.  We  establish  desirable 
statistical  properties  of  the  estimator,  namely  consistency  and  asymptotic  normality.  We  eval¬ 
uate  the  estimator  through  simulation  and  observe  that  it  is  robust  with  respect  to  moderate 
violations  of  the  underlying  model. 

Keywords.  End-to-end  measurements,  queueing  delay,  estimation  theory,  multicast  tree,  network 
tomography. 

*This  work  was  sponsored  in  part  by  the  DARPA  and  the  Air  Force  Research  Laboratory  under  agreement  F30602- 
98-2-0238. 
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1  Introduction 


Background  and  Motivation.  Monitoring  the  performance  of  large  communications  networks 
is  essential  for  diagnosing  the  causes  of  performance  degradation.  There  are  two  broad  approaches 
to  monitoring.  In  the  internal  approach,  direct  measurements  are  made  at  or  between  network 
elements,  e.g.  of  packet  loss  or  delay.  In  the  external  approach,  measurements  are  made  across  a 
network  on  end-to-end  or  edge-to-edge  paths. 

The  internal  approach  has  a  number  of  potential  limitations.  Due  to  the  commercial  sensitivity 
of  performance  measurements,  and  the  potential  load  incurred  by  the  measurement  process,  it  is 
expected  that  measurement  access  to  network  elements  will  be  limited  to  service  providers  and, 
possibly,  selected  peers  and  users.  The  internal  approach  assumes  sufficient  coverage,  i.e.  that 
measurements  can  be  performed  at  all  relevant  elements  on  paths  of  interest.  In  practice,  not  all 
elements  may  possess  the  required  functionality,  or  it  may  be  disabled  at  heavily  utilized  elements 
in  order  reduce  CPU  load.  On  the  other  hand,  arranging  for  complete  coverage  of  larger  networks 
raises  issues  of  scale,  both  in  the  in  the  gathering  of  measurement  data,  and  joining  data  collected 
from  a  large  number  of  elements  in  order  to  form  a  composite  view  of  end-to-end  performance. 

This  motivates  external  approaches,  network  diagnosis  through  end-to-end  measurements,  with¬ 
out  necessarily  assuming  the  cooperation  of  network  elements  on  the  path.  There  has  been  much 
recent  experimental  work  to  understand  the  phenomenology  of  end-to-end  performance  (e.g.,  see 
[3,  9,  19,  26,  27,  29]).  Several  research  efforts  are  working  on  the  developments  of  measurement 
infrastructure  projects  (Felix  [13],  IPMA  [15],  NIMI  [18]  and  Surveyor  [35])  with  the  aim  to  col¬ 
lect  and  analyze  end-to-end  measurements  across  a  mesh  of  paths  between  a  number  of  hosts. 
Standard  diagnostic  tools  for  IP  networks,  ping  and  traceroute  report  roundtrip  loss  and  de¬ 
lay,  the  latter  incrementally  along  the  IP  path  by  manipulating  the  time-to-live  (TTL)  field  of  probe 
packets.  A  recent  refinement  of  this  approach,  pathchar  [17],  estimates  hop-by-hop  link  capac¬ 
ities,  packet  delay  and  loss  rates,  pathchar  is  still  under  evaluation;  initial  experience  indicates 
many  packets  are  required  for  inference  leading  to  either  high  load  of  measurement  traffic  or  long 
measurement  intervals,  although  adaptive  approaches  can  reduce  this  [10].  More  broadly,  measure¬ 
ment  approaches  based  on  TTL  expiry  require  the  cooperation  of  network  elements  in  returning 
Internet  Control  Message  Protocol  (ICMP)  messages.  Finally,  the  success  of  active  measurement 
approaches  to  performance  diagnosis  may  itself  cause  increased  congestion  if  intensive  probing 
techniques  are  widely  adopted. 

In  response  to  some  of  these  concerns,  a  multicast-based  approach  to  active  measurement  has 
been  proposed  recently  in  [4,  5].  The  idea  behind  the  approach  is  that  correlation  in  performance 
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seen  on  intersecting  end-to-end  paths  can  be  used  to  draw  inferences  about  the  performance  charac¬ 
teristics  of  the  common  portion  (the  intersection)  of  the  paths,  without  the  cooperation  of  network 
elements  on  the  path.  Multicast  traffic  is  particular  well  suited  for  this  since  a  given  packet  only 
occurs  once  on  a  given  link  in  the  (logical)  multicast  tree.  Thus  characteristics  such  as  loss  and 
end-to-end  delay  of  a  given  multicast  packet  as  seen  at  different  endpoints  are  highly  correlated. 
Another  advantage  of  using  multicast  traffic  is  scalability.  Suppose  packets  are  exchanged  on  a 
mesh  of  paths  between  a  collection  of  N  measurement  hosts  stationed  in  a  network.  If  the  pack¬ 
ets  are  unicast,  then  the  load  on  the  network  may  grow  proportionally  to  in  some  parts  of  the 
network,  depending  on  the  topology.  For  multicast  traffic  the  load  grows  proportionally  only  to  N . 

Contribution  The  work  of  [4,  5]  showed  how  multicast  end-to-to  measurements  can  be  used  to 
infer  per  link  loss  rates  in  a  logical  multicast  tree.  In  this  paper  we  extend  this  approach  to  infer  the 
probability  distribution  of  the  per  link  variable  delay.  Thus  we  are  not  concerned  with  propagation 
delay  on  a  link,  but  rather  the  distribution  of  the  additional  variable  delay  that  is  attributable  to 
either  queuing  in  buffers  or  other  processing  in  the  router.  A  key  part  of  the  method  is  an  analysis 
that  relates  the  probabilities  of  certain  events  visible  from  end-to-end  measurements  (end-to-end 
delays)  to  the  events  of  interest  in  the  interior  of  the  network  (per-link  delays).  Once  this  relation 
is  known,  we  can  estimate  the  delay  distribution  on  each  link  from  the  measured  distributions  of 
end-to-end  delays  of  multicast  packets. 

For  a  glimpse  of  how  the  relations  between  end-to-end  delay  and  per  link  delays  could  be 
found,  consider  a  multicast  tree  spanning  a  source  of  multicast  probes  (identified  as  the  root  of  the 
tree)  and  a  set  of  receivers  (one  at  each  leaf  of  the  tree).  We  assume  the  packets  are  potentially 
subject  to  queuing  delay  and  even  loss  at  each  link.  Focus  on  a  particular  node  k  in  the  interior  of 
the  tree.  If,  for  a  given  packet,  the  source-to-leaf  delay  does  not  exceed  a  given  value  on  any  leaf 
descended  from  k,  then  clearly  the  delay  from  the  root  to  the  node  k  was  less  than  that  value.  The 
stated  desired  relation  between  the  distributions  of  per-link  and  source-to-leaf  delays  is  obtained 
by  a  careful  enumeration  of  the  different  ways  in  which  end-to-end  delay  can  be  split  between  the 
portion  of  the  path  above  or  below  the  node  in  question,  together  with  the  assumption  that  per- 
link  delays  are  independent  between  different  links  and  packets.  We  shall  comment  later  upon  the 
robustness  of  our  method  to  violation  of  this  independence  assumption. 

We  model  link  delay  by  non-parametric  discrete  distributions.  The  choice  of  non  parametric 
distributions  rather  than  a  parameterized  delay  model  is  dictated  by  the  lack  of  knowledge  of 
the  distribution  of  link  delays  in  networks.  While  there  is  significant  prior  work  on  the  analysis 
and  characterization  of  end-to-end  delay  behavior  (see  [2,  24,  27]),  to  the  best  of  our  knowledge 
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there  is  no  general  model  for  per  link  delays.  The  use  of  a  non-parametric  model  provides  the 
flexibility  to  capture  broadly  different  delay  distributions,  albeit  at  the  cost  of  increasing  the  number 
of  quantities  to  estimate  (i.e.  the  weights  in  the  discrete  distribution).  Indeed,  we  believe  that  our 
inference  technique  can  shed  light  on  the  behavior  and  dynamics  of  per  link  delays  and  so  provide 
useful  results  for  the  analysis  and  modeling;  this  we  will  consider  in  future  work. 

The  discrete  distribution  can  be  a  regarded  as  binned  or  discretized  version  of  the  (possibly 
continuous)  true  delay  distribution.  Use  of  a  discrete  rather  than  a  continuous  distribution  allows 
us  to  perform  the  calculations  for  inference  using  only  algebra.  Formally,  there  is  no  difficulty  in 
formulating  a  continuous  version  of  the  inference  algorithm.  However,  it  proceeds  via  inversion 
of  Laplace  transforms,  a  procedure  that  is  in  practice  implemented  numerically.  In  the  discrete 
approach  we  can  explicitly  trade-off  the  detail  of  the  distribution  with  the  cost  of  calculation;  the 
cost  is  inversely  proportional  to  the  bin  widths  of  the  discrete  distribution. 

The  principle  results  of  the  analysis  are  as  follows.  Based  on  the  independent  delay  model, 
we  derive  an  algorithm  to  estimate  the  per  link  discrete  delay  distributions  and  utilization  from  the 
measured  end-to-end  delay  distributions.  We  investigate  the  statistical  properties  of  the  estimator, 
and  show  it  to  be  strongly  consistent,  i.e.,  it  converges  to  the  true  distribution  as  the  number  of 
probes  grows  to  infinity.  We  show  that  the  estimator  is  asymptotically  normal;  this  allows  us  to 
compute  the  rate  of  convergence  of  the  estimator  to  its  true  value,  and  to  construct  confidence 
intervals  for  the  estimated  distribution  for  a  given  number  of  probes.  This  is  important  because  the 
presence  of  large  scale  routing  fluctuation  (e.g.  as  seen  in  the  Internet;  see  [26])  sets  a  timescale 
within  which  measurement  must  be  completed,  and  hence  the  accuracy  that  can  be  obtained  when 
sending  probes  at  a  given  rate. 

We  evaluated  our  approach  through  extensive  simulation  in  two  different  settings.  The  first  set 
used  a  model  simulation  in  which  packet  delays  obey  the  independence  assumption  of  the  model. 
We  applied  the  inference  algorithm  to  the  end-to-end  delays  generated  in  the  simulation  and  com¬ 
pared  the  (true)  model  delay  distribution.  We  verified  the  convergence  to  the  model  distribution, 
and  also  the  rate  of  convergence,  as  the  number  of  probes  increased. 

In  the  second  set  of  experiments  we  conducted  an  ns  simulation  of  packets  on  a  multicast  tree. 
Packet  delays  and  losses  were  entirely  due  to  queueing  and  packet  discard  mechanisms,  rather  than 
model  driven.  The  bulk  of  the  traffic  in  the  simulations  was  background  traffic  due  to  TCP  and 
UDP  traffic  sources;  we  compared  the  actual  and  predicted  delay  distributions  for  the  probe  traffic. 
Here  we  found  rapid  convergence,  although  with  some  persistent  differences  with  respect  to  the 
actual  distributions. 

These  differences  appear  to  be  caused  by  violation  of  the  model  due  to  the  presence  of  spa- 
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tial  dependence  (i.e.,  dependence  between  delays  on  different  links).  In  our  simulations  we  find 
that  when  this  type  of  dependence  occurs,  it  is  usually  between  the  delays  on  child  and  parent 
links.  However,  it  can  extend  to  entire  paths.  As  far  as  we  know  there  are  no  experimental  results 
concerning  the  magnitude  of  such  dependence  in  real  networks.  In  any  case,  by  explicitly  intro¬ 
ducing  spatial  correlations  into  the  model  simulations,  we  were  able  to  show  that  small  violations 
of  the  independence  assumption  lead  to  only  small  inaccuracies  of  the  estimated  distribution.  This 
continuity  property  of  the  deformation  in  inference  due  to  correlations  is  also  to  be  expected  on 
theoretical  grounds. 

We  also  verified  the  presence  of  temporal  dependence,  i.e.,  dependence  between  the  delays 
between  successive  probes  on  the  same  link.  This  is  to  be  expected  from  the  phenomenology  of 
queueing:  when  a  node  is  idle,  many  consecutive  probes  can  experience  constant  delay;  during 
congestion,  probes  can  experience  the  same  delay  if  their  interarrival  time  is  smaller  than  the  con¬ 
gestion  timescale.  This  poses  no  difficulty  as  all  that  is  required  for  consistency  of  the  estimator  is 
ergodicity  of  the  delay  process,  a  far  weaker  assumption  than  independence.  However,  dependence 
can  decrease  the  rate  of  convergence  of  the  estimators.  In  our  experiments,  inferred  values  closely 
tracked  the  actual  ones  despite  the  presence  of  temporal  dependence. 

Implementation  Requirements  Since  the  data  for  delay  inference  comprises  one-way  packet 
delays,  the  primary  requirement  is  the  deployment  of  measurement  hosts  with  synchronized  clocks. 
Global  Positioning  System  (GPS)  systems  afford  one  way  to  achieve  a  synchronization  to  within 
tenths  of  microseconds;  it  is  currently  used  or  planned  in  several  of  the  measurement  infrastructures 
mentioned  earlier.  More  widely  deployed  is  the  Network  Time  Protocol  (NTP)  [20] .  However,  this 
provides  accuracy  only  on  the  order  of  milliseconds  at  best,  a  resolution  at  least  as  coarse  as  the 
queueing  delays  in  practice.  An  alternative  approach  that  could  supplement  delay  measurement 
from  unsynchronized  or  coarsely  synchronized  clocks  has  been  developed  in  [28,  30,  21].  These 
authors  propose  algorithms  to  detect  clock  adjustments  and  rate  mismatches  and  to  calibrate  the 
delay  measurements. 

Another  requirement  is  knowledge  of  the  multicast  topology.  There  is  a  multicast-based  mea¬ 
surement  tool,  mtrace  [23],  already  in  use  in  the  Internet,  mtrace  reports  the  route  from  a 
multicast  source  to  a  receiver,  along  with  other  information  about  that  path  such  as  per-hop  loss 
and  rate.  Presently  it  does  not  support  delay  measurements.  A  potential  drawback  for  larger 
topologies  is  that  mtrace  does  not  scale  to  large  numbers  of  receivers  as  it  needs  to  run  once  for 
each  receiver  to  cover  the  entire  multicast  tree.  In  addition,  mtrace  relies  on  multicast  routers 
responding  to  explicit  measurement  queries;  a  feature  that  can  be  administratively  disabled.  An 
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alternative  approach  that  is  closely  related  to  the  work  on  multicast-based  loss  inference  [4,  5]  is  to 
infer  the  logical  multicast  topology  directly  from  measured  probe  statistics;  see  [31]  and  [7].  This 
method  does  not  require  cooperation  from  the  network. 

Structure  of  the  Paper.  The  remaining  sections  of  the  paper  are  organized  as  follows.  In  Sec¬ 
tion  2  we  describe  the  delay  model  and  in  Section  3  we  derive  the  delay  estimator.  In  Section  4  we 
describe  the  algorithm  used  to  compute  the  estimator  from  data.  In  Section  5  we  present  the  model 
and  network  simulations  used  to  evaluate  our  approach.  Section  6  concludes  the  paper. 

2  Model  &  Framework 

2.1  Description  of  the  Logical  Multicast  Tree 

We  identify  the  physical  multicast  tree  as  comprising  actual  network  elements  (the  nodes)  and  the 
communication  links  than  join  them.  The  logical  multicast  tree  comprises  the  branch  points  of  the 
physical  tree,  and  the  logical  links  between  them.  The  logical  links  comprise  one  or  more  physical 
links.  Thus  each  node  in  the  logical  tree,  except  for  the  leaf  nodes  and  possibly  the  root,  must 
have  2  or  more  children.  We  can  construct  the  logical  tree  from  the  physical  tree  by  deleting  all 
links  with  one  child  (except  for  the  root)  and  adjusting  the  links  accordingly  by  directly  joining  its 
parent  and  child. 

Let  T  =  (L,  L)  denote  the  logical  multicast  tree,  consisting  of  the  set  of  nodes  V,  including 
the  source  and  receivers,  and  the  set  of  links  L,  which  are  ordered  pairs  (  j,  k)  of  nodes,  indicating 
a  link  from  j  to  k.  We  will  denote  U  =  L  \  {0}.  The  set  of  children  of  node  j  is  denoted  by 
d{j)\  these  are  the  nodes  whose  parent  is  j.  Nodes  are  said  to  be  siblings  if  they  have  the  same 
parent.  For  each  node  j,  other  than  the  root  0,  there  is  a  unique  node  /(j  ),  the  parent  of  j,  such 
that  {  f{  j)^j)  G  L.  Each  link  can  therefore  be  also  identified  by  its  “child”  endpoint.  We  shall 
define  /”(A;)  recursively  by  /”(A;)  =  /(/”“^(A;))  with  =  /.  We  say  that  j  is  a  descendant  of 
A;  if  A;  =  /”(j)  for  some  integer  n  >  0,  and  write  the  corresponding  partial  order  mV  j  ^  k. 
For  each  node  j  we  define  its  level  £{j)  to  be  the  non-negative  integer  such  that  f^k')  (j)  =  0.  The 
root  0  G  L  represents  the  source  of  the  probes  and  the  set  of  leaf  nodes  R  CV  (i.e.,  those  with  no 
children)  represents  the  receivers. 
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2.2  Modeling  Delay  and  Loss  of  Probe  Packets 

Probe  packets  are  sent  down  the  tree  from  the  root  node  0.  Each  probe  that  arrives  at  node  k  results 
in  a  copy  being  sent  to  every  child  of  k.  We  associate  with  each  node  k  a  random  variable  Dk  taking 
values  in  the  extended  positive  real  line  E+  U  {(do}.  By  convention  Do  =  0.  is  the  random 
delay  that  would  be  encountered  by  a  packet  attempting  to  traverse  the  link  (/(A;),  A;)  G  L.  The 
value  Dk  =  oo  indicates  that  the  packet  is  lost  on  the  link.  We  assume  that  the  Dk  are  independent. 

The  delay  experienced  on  the  path  from  the  root  0  to  a  node  k  is  Yk  =  Yhj^k  ■  Note  that  Yk  =  oo 
iff  Dj  =  oo  for  some  j  Y  k,  i.e.  if  the  packet  was  lost  on  some  link  between  node  0  and  k. 

Unless  otherwise  stated,  we  will  discretize  each  link  delay  Dk  to  a  set  {0,  g,  2g, . . . ,  imax9,  oo}- 
Here  q  is  the  bin  width,  Vax  +  1  is  the  number  of  bins,  and  the  point  oo  is  interpreted  as  “packet 
lost”  or  “encountered  delay  greater  than  Vax?”-  The  distribution  of  Dk  is  denoted  by  ak,  where 
=  P[Dk  =  iq]  with  afc(oo)  the  probability  that  Dk  =  oo.  For  each  link,  we  denote  Uk  the 
link  utilization-,  then,  Uk  =  I  —  0^(0),  the  probability  that  a  packet  experience  delay  or  it  is  lost  in 
traversing  link  k. 

For  each  k  e  V,  the  cumulative  delay  process  Yk,k  e  V,  takes  values  in  {0,  g,  2g, . . . ,  ima.^ql{k) ,  oo}, 
i.e.,  it  supports  addition  in  the  ranges  of  the  constituent  Dj.  We  set  Ak{i)  =  P[Tfc  =  iq]  with 
Afc(oo)  the  probability  that  Yk  =  oo.  Because  of  delay  independence,  for  finite  i,  Ak{i)  = 
-j)^k  eU-,hy  convention  Ao(0)  =  1. 

We  consider  only  canonical  delay  trees.  A  delay  tree  consists  of  the  pair  (T,  a),  r  =  {v,L), 
a  =  (afc(0)fce(7,*e{o,...,*max}-  ^  delay  tree  is  said  to  be  canonical  if  0^(0)  >  0,  VA;  G  U,  i.e.,  if  there 
is  a  non-zero  probability  that  a  probe  experiences  no  delay  in  traversing  each  link. 

3  Delay  Distribution  Estimator  and  its  Properties 

Consider  an  experiment  in  which  n  probes  are  sent  from  the  source  node  down  the  multicast  tree. 

As  result  of  the  experiment  we  collect  the  set  of  source-to-leaf  delays  {Ykq)keR,i=i,-,n-  Our  goal  is 
to  infer  the  internal  delay  characteristics  solely  from  the  collected  end-to-end  measurements. 

In  this  section  we  state  the  main  analytic  results  on  which  inference  is  based.  In  Section  3.1 
we  establish  the  key  property  underpinning  our  delay  distribution  estimator,  namely  the  one-to- 
one  correspondence  between  the  link  delay  distributions  and  the  probabilities  of  a  well  defined  set 
of  observable  events.  Applying  this  correspondence  to  measured  leaf  delays  allows  us  to  obtain 
an  estimate  of  the  link  delay  distribution.  We  show  that  the  estimator  is  strongly  consistent  and 
asymptotically  normal.  In  Section  3.2  we  present  the  proof  of  the  main  result  which  also  provides 
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the  construction  of  the  algorithm  to  compute  the  estimator  we  present  in  Section  4.  In  Section  3.4 
we  analyze  the  rate  of  convergence  of  the  estimator  as  the  number  of  probes  increase. 

3.1  The  Delay  Distribution  Estimator 

Let  T{k)  =  {V{k),L{k))  denote  the  subtree  rooted  at  node  k  and  R{k)  =  Rr\V{k)  the  set 
of  receivers  which  descend  from  k.  Let  Ofc(i)  denote  the  event  {minjg7^(fc)  Yj  <  iq}  that  the 
end-to-end  delay  is  no  greater  than  iq  for  at  least  least  one  receiver  in  R{h)  .  Let  7fc(i)  = 
P[Ofc(i)]  denote  its  probability.  Finally  let  F  denote  the  mapping  associating  the  link  distribu¬ 
tions  (afc(*))fcec/,*e{o,...,wx}  to  the  probabilities  of  the  events  Ofc(z),7  =  (7fc(0)fcec/,*e{o,...,wx}-  The 
proof  of  the  next  result  is  given  in  the  following  section. 

Theorem  1  Let  A  =  {a  =  (afc(0)fcec/,*e{o,...,*n,ax}  :  «fc(0)  >  0^  Skwx 

=  {7  =  (7fc(0)fcec/,*e{o,...,*max}  ^  3a  G  A\^  =  r(a)}.  V  is  a  bijection  from  A  to  Q  which  is 
continuously  differentiable  and  has  a  continuously  differentiable  inverse. 

Estimate  7  by  the  empirical  probabilities  7,  where 

n 

7fc(0  =  Y.  (1) 

7Tl  =  l 

1{5}  denotes  the  indicator  function  of  the  set  S  and  are  the  subsidiary  quantities 

%,m  =  min  k  U.  (2) 

deR(k) 

Our  estimate  of  ak{i)  is  ak{i)  =  (r“^(7))fc(i).  We  estimate  link  k  utilization  by  Uk  =  1  —  afc(O). 

Let  =  {a  =  {ak{^))keu,^e{o,...,^m.R  ■  «fc(0)  >  0^  E*<wx  denote  the  open 

interior  of  A.  The  following  holds: 

Theorem  2  When  y  G  as  n  ^  00,  a  =  V  ^(7)  converge  almost  surely  to  a,  i.e.,  the 

estimator  is  strongly  consistent. 

Proof:  Since  is  continuous  on  and  A^^'^  is  open  in  A,  it  follows  that  is  an 

open  set  in  F(^).  By  the  Strong  Law  of  large  numbers,  since  7  is  the  mean  of  n  independent 
random  variables,  7  converges  to  7  almost  surely  for  n  — >  00.  Therefore,  when  7  G  F(^(^)),  there 
exists  no  such  that  7  G  F(A(^)),  n  >  hq.  Then,  the  continuity  of  F“^  insures  that  a  converges 
almost  surely  to  a  as  n  — >  00.  ■ 
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3.2  Proof  of  Theorem  1 


To  prove  the  Theorem,  we  first  express  7  as  function  of  a  and  then  show  that  the  mapping  from  A 
to  Q  is  injective. 

3.2.1  Relating  7  to  a 


Denote  fAiy)  Tj  ^f{k)  —  ^  0,  *  *  * ,  ^max*  fAiy) 

obeys  the  recursion 

/3fc(^)  =  Ej=o  1  -  Ilded(k)  (1  -  -  j)) 

keU\R 

(3) 

dkd)  =  Ej=o 

k  e  R. 

Then,  by  observing  that 

7fc(0  =  -  j)Apk){j), 

7=0 

(4) 

k  e  U  \  R,  we  readily  obtain 

7fc(0  =  EEo  Mj)  1  -  Udedik)  (1  -  dd{i  -  j)) 

keU\R 

(5) 

7fc(0  =  EEo  Mj) 

k  e  R 

The  set  of  equations  (5)  completely  identifies  the  mapping  F  from  Ato  Q.  The  mapping  is  clearly 

continuously  differentiable.  Observe  that  the  above  expressions  can  be  regarded  as  a  generalization 
of  those  derived  for  the  loss  estimator  in  [4]  (by  identifying  the  event  no  delay  with  the  event  no 
loss). 

3.2.2  Relating  a  to  7 

It  remains  to  show  that  the  mapping  from  AtoQh  injective.  To  this  end,  below  we  derive  an 
algorithm  for  inverting  (5).  We  postpone  to  Appendix  A  the  proof  that  the  inverse  is  unique  and 
continuously  differentiable.  For  sake  of  clarity  we  separate  the  algorithm  into  two  parts:  in  the  first 
we  derive  the  cumulative  delay  distributions  A  from  7;  then,  we  deconvolve  A  to  obtain  a. 

Computing  A 

Step  0: 

Solve  (5)  for  f  =  0.  This  amounts  solving  the  equation 

(1  -  7fc(0)/Afc(0))  =  n  (1  -  740)Mfc(0)),  keU\R  (6) 

dE:d[k) 


101 


and 


7fc(0)  =  Afc(O),  keR.  (7) 

This  equation  is  formally  identical  to  the  one  of  the  loss  estimator  [4].  From  [4],  we  have  that  the 
solution  of  (6)  exists  and  is  unique  in  (0, 1)  provided  that  0  <  7^(0)  <  7d(0)  which  holds 

for  canonical  delay  trees.  We  then  compute  /3fc(0)  =  7^(0) k  ^  U. 

Step  i: 

Given  Ak{j)  and  (3k{j),  k  G  U,  j  =  0, . . . ,  i  —  1,  in  this  step  we  compute  Ak{i)  and  (3k{i), 
k  ^  U.  For  k  e  U  \  R,  in  expression  (5)  we  replace  (3d{i)  with 
(from  (4))  and  obtain  the  following  equation 


7d(0-Ej  =  i  l3d{i-j)^kU)-l3d{0)Ak{i) 
Ak{0) 


7fc(*)  +  ^fc(O)  <  ridGdifc) 


1  - 


7d(0-Ej=i  Pdk-j)AkU)-Pd(o)Ak{i) 
Ak{0) 


-1  + 


ZT=\Mj)  {n  dEd[k)  [1  —  dd{i  —  j)]  “  l}  +  *){n  dEd[k)  [l-/3d(0)]-l}  =0 


(8) 


(the  unknown  term  Ak{i)  is  highlighted  in  boldface).  This  is  a  polynomial  in  Ak{i)  of  degree 
^d{k).  As  shown  in  Appendix  A  we  consider  the  second  largest  solution  of  (8). 


For  k  ^  R,  we  directly  compute  Ak{i)  from  (5),  Ak{i)  =  7fc(0  “  Yl]=o  ^kij)-  Then  we 


compute  (3k{i),  k  e  U,  as  (3k{i)  = 


^f{k)  (0) 


Computing  a 


Once  step  imax  is  completed,  we  compute  ak{i),k  G  U  as  follows 


ak{i) 


Ak(0) 

A(g(o)  . 

^fc(0-E)  =  i  ^f(k)U)ak{i-j) 
^f(k)  (0) 


0 

1 G  *  G  ^max* 


(9) 


3.3  Example:  the  Two-leaf  Tree 

In  this  section  we  illustrate  the  application  of  the  results  of  Section  3.1  to  the  two-leaf  tree  of 
Figure  1.  We  assume  that  on  each  link,  a  probe  either  suffers  no  delay,  a  unit  amount  of  delay,  or 
is  otherwise  lost;  for  A;  G  {1,2,  3},  therefore,  delay  takes  values  in  {0, 1,  cxd}. 

For  this  example,  equations  (6)  and  (8)  can  be  solved  explicitly;  combined  with  (9)  we  obtain 
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Figure  1:  Two-leaf  multicast  Tree. 


4  5  6  7 

Figure  2:  Four-leaf  multicast  Tree. 


the  estimates 


3.4  Rates  of  Convergences  of  the  Delay  Distribution  Estimator 

3.4.1  Asymptotic  Behavior  of  the  Delay  Distribution  Estimator 

In  this  section,  we  study  the  rate  of  convergence  of  the  estimator.  Theorem  2  states  that  a  con¬ 
verges  to  a  with  probability  1  as  n  grows  to  infinity;  but  it  provides  no  information  on  the  rate  of 
convergence.  Because  of  the  mild  conditions  satisfied  by  we  can  use  Central  Limit  Theorem 
to  establish  the  following  asymptotic  result 

Theorem  3  Whenj  G  as  n  ^  oo,  ^/n{a  —  a)  converges  in  distribution  to  a  multivariate 

normal  random  variable  with  mean  vector  0  and  covariance  matrix  v  =  D{a)  ■  a  ■  [a)  where 

^(kui)(k2,3)  =  f^Cov(7fcj(z),7fc2(j))>/o^^o  k2  G  U,  i,j  G  {0, . . . ,  ^(fci,0(fc2j)(“)  = 

dil^U)  denotes  the  transpose. 
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Proof:  By  the  Central  Limit  Theorem,  it  follows  that  the  random  variables  7  are  asymptotically 
Gaussian  as  n  — >  00  with 

v^(7  ~  7)  ^  '^) 

Here  V  denotes  convergence  in  distribution.  Following  the  same  lines  of  the  proof  of  Theorem 
1,  when  7  G  there  existst  hq  such  that  7  G  n  >  hq.  Then,  Since  is 

continuously  differentiable  on  Q,  the  Delta  method  (see  Chapter  7  of  [34])  yields  that  a  =  r  1(7) 
is  also  asymptotically  Gaussian  as  n  — >  00: 

^(r“^(7)  —  a)  ^  A((0,  u) 


Theorem  3  allows  us  to  compute  confidence  intervals  of  the  estimates,  and  therefore  their 
accuracy  and  their  convergence  rate  to  the  true  values  as  n  grows.  This  is  relevant  in  assessing: 
(i)  the  number  of  probes  required  to  obtain  a  desired  level  of  accuracy  of  the  estimate;  (ii)  the 
likely  accuracy  of  the  estimator  from  actual  measurements  by  associating  confidence  intervals  to 
the  estimates. 

For  large  n,  the  estimator  ak{i)  will  lie  in  the  interval 


(10) 

where  zs/2  is  the  1  —  ^/2  quantile  of  the  standard  distribution  and  the  interval  estimate  is  a  100(1  — 
S)%  confidence  interval. 

To  obtain  the  confidence  interval  for  a  derived  from  measured  data  from  n  probes,  we  estimate 

uhy  u  =  D{a)  ■  a  ■  D^{a)  where 


(Ji 


(ki,i)(k2,j) 


n  —  1 


,  /=! 


<id  /\  Yk^2<jd}  ^  ^  V  ^ 

/=!  /=! 


and  D{a)  is  the  Jacobian  of  the  inverse  map  V  ^  computed  for  a 
intervals  of  the  form 


ak{i)  ±  ZSI2 


n 


a.  We  then  use  confidence 


(11) 


3.4.2  Dependence  of  the  Delay  Distribution  Estimator  on  Topology 

The  estimator  variance  determines  the  number  of  probes  required  to  obtain  a  given  level  of  ac¬ 
curacy.  Therefore,  it  is  important  to  understand  how  the  variance  is  affected  by  the  underlying 
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Figure  3:  Asymptotic  Estimator  Variance  and  Tree  Depth.  Binary  tree  with  depth  2,  3 
and  4.  Left:  Minimum  and  Maximum  Variance  of  the  estimates  Q'fc(O)  (a)  and  Q'fc(l)  (b)  over  all 
links. 


parameters,  namely  the  delay  distributions  and  the  multicast  tree  topology.  The  following  Theo¬ 
rem,  the  proof  of  which  we  postpone  to  Appendix  C,  characterizes  the  behavior  of  the  variance  for 
small  delays.  Set  ||  a  ||  =  maxfcg[/^i>o  o,k{i). 

Theorem  4  A5  ||  a  ||  — >  0, 

0  0  ...  0  ^  Cl/e  (2)  •  •  •  Cl/^j^i  max  ) 

i^k2k2  0  ...  0  cife(l)  afe(l)  0  ...  0 

0  ■■■  0  +0{\\a\f)  with  v^k  =  cife(2)  0  cife(2)  ...  0 

0  0  ...  t^k#uk#u  /  V  Clfe(i  max)  0  0  •  •  •  Cl/e(^'max) 

(12) 


Theorem  4  states  that  the  estimator  variance  is,  to  first  order,  independent  of  the  topology.  To 
explore  higher  order  dependencies,  we  computed  the  asymptotic  variance  for  a  selection  of  trees 
with  different  depths  and  branching  ratio.  We  use  the  notation  r(ri, . . . ,  r^)  to  denote  a  tree  of 
m  +  1  levels  where,  apart  from  node  0  that  has  one  descendent,  nodes  at  level  j  have  exactly 
rj  children.  For  simplicity,  we  consider  the  case  when  link  delay  takes  values  in  {0, 1},  i.e.,  we 
consider  no  loss,  and  study  the  behavior  as  function  of  ^^(l)  =  a. 

In  Figure  3  we  show  the  dependence  on  tree  depth  for  binary  trees  of  depth  2,  3  and  4.  We  plot 
the  maximum  value  of  the  variance  over  the  links  max/;  Var(a/;(0))  (a)  and  max/;  Var(a/;(1))  (b). 
In  these  examples,  the  variance  increases  with  the  tree  depth.  In  Figure  4  we  show  the  dependence 
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Figure  4:  ASYMPTOTIC  ESTIMATOR  Variance  and  Branching  Ratio.  Binary  tree  with 
depth  2  and  2,  4  or  6  receivers.  Left:  Variance  of  Q'fc(O)  (a)  and  d'fc(l)  (b)  for  link  1  (common  link) 
and  2  (generic  receiver). 


on  branching  ratio  for  a  tree  of  level  2.  We  plot  the  estimator  variance  for  both  link  1  (the  common 
link)  and  link  2  (a  generic  receiver).  In  these  examples,  increasing  the  branching  ratio  decreases 
the  variances,  especially  those  of  the  common  link  estimates  which  increases  less  than  linearly  for 
a  up  to  0.7  when  the  branching  ratio  is  larger  than  3.  In  all  cases,  the  variance  of  Q'fc(l)  is  larger 
than  Q'fc(O). 

In  all  cases,  as  predicted  by  Theorem  4,  the  estimator  variance  is  asymptotically  linear  in  a 
independently  of  the  topology  as  a  — >  0.  As  a  increases,  the  behavior  is  affected  by  different 
factors:  increasing  the  branch  ratio  results  in  a  reduction  of  the  variance,  while  increasing  the  tree 
depth  results  in  variance  increase.  The  first  can  be  explained  in  terms  of  the  increased  number  of 
measurements  available  for  the  estimation  as  the  number  of  receivers  sharing  a  given  link  increases; 
the  second  appears  to  be  the  effect  of  cumulative  errors  that  accrue  as  the  number  of  links  along  a 
path  increases  (a  is  computed  iteratively  on  the  tree).  We  also  observe  that  the  variance  increases 
with  the  delay  lag;  this  appears  to  be  caused  by  the  iterative  computation  on  the  number  of  bins 
that  progressively  cumulate  errors. 


4  Computation  of  the  Delay  Distribution  Estimator 

In  this  section  we  describe  an  algorithm  for  computing  the  delay  distribution  estimate  from  mea¬ 
surements  based  on  the  results  presented  in  the  previous  section.  We  also  discuss  its  suitability  for 
distributed  implementation  and  how  to  adapt  the  computation  to  handle  different  bin  sizes. 
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We  assume  the  experimental  data  of  source-to-leaf  delays  from  n  probes,  as 

eollected  at  the  leaf  nodes  k  ^  R.  Two  steps  must  be  initially  performed  to  render  the  data  into  a 
form  suitable  for  the  inference  algorithms:  (i)  removal  of  fixed  delays  and  (ii)  choosing  a  bin  size 
q  and  computing  the  estimate  7. 

The  first  step  is  necessary  since  it  is  generally  not  possible  to  apportion  the  deterministic  com¬ 
ponent  of  the  source-to-leaf  delays  between  interior  links.  (To  see  this,  it  is  sufficient  to  consider 
the  case  of  the  two  receiver  tree;  expressing  the  link  fixed  delays  in  terms  of  the  source-to-leaf 
fixed  delays  results  in  two  equations  in  three  unknowns).  Thus  we  normalize  each  measurement 
by  subtracting  the  minimum  delay  seen  at  the  leaf.  Observe  that  to  interpret  the  observed  minimum 
delay  as  the  transmission  delay  assumes  that  at  least  one  probe  has  experienced  no  queuing  delay 
along  the  path). 

The  second  step  is  to  choose  the  bin  size  q  and  discretize  the  delays  measurements  accordingly. 
This  introduces  a  quantization  error  which  affects  the  accuracy  of  the  estimates.  As  our  results  have 
shown,  the  accuracy  increases  as  q  decreases  (we  have  obtained  accurate  results  over  a  significant 
range  of  values  of  q  up  to  the  same  order  of  magnitude  of  the  links  average  delay).  The  choice  of 
q  represents  a  trade-off  between  accuracy  and  cost  of  the  computation  as  a  smaller  bin  size  entails 
a  higher  computational  cost  due  to  the  higher  dimensionality  of  the  binned  distributions. 

These  two  steps  are  carried  out  as  follows.  From  the  measured  data  {Yk)keR,  we  recursively 
construct  the  auxiliary  vector  process  Y  =  {Yk)kev 

%,i  =  Ykq  -  min  14, k  e  R  (13) 

Yk,i  =  min  k(EV\R.  (14) 

jed{k) 

The  binned  estimates  of  7  are 

n 

7fc(0  =  ^  ^{T:.m<7  +  g/2]}’  *  =  0,  .  .  .  ,  imax,  (15) 

7Tl  =  l 

with  ^ 

^max  — 

q 

Here  [x]  denotes  the  smallest  integer  greater  than  x  and  A4(n)  =  {m  G  {1, . . . ,  n}|14,m  <  00}. 
Observe  that  i^ax  represents  the  largest  value  at  which  the  estimate  (((imax)  is  non  zero. 

The  estimate  can  be  computed  iteratively  over  the  delay  lag  and  recursively  over  the  tree. 
The  pseudo  code  for  carrying  out  the  computation  is  found  in  Figure  5.  The  procedure  f  ind_y 
calculates  I4  and  ^k,  with  I4,;  initialized  to  I4,/  —  min^g^i  14,^  for  A;  G  i?  and  00  (a  value 
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procedure  main  { 
find_y  (  1  )  ; 
foreach  (  i  £  {0,  •  •  • ,  imax}  ) 

infer_delay  {  1  ,  i  )  ; 

} 

procedure  find_y  {  k  )  { 
foreach  (  j  e  d{k)  )  { 

Yj  =  find_y  {  j  )  ; 

foreach  (  me  n}  ) 

Yk  [m]  =  min{  Yfe  [m] ,  Yj  [m] }  ; 

} 

foreach  (  i  £  {0,  •  •  • ,  imax}  ) 

%[A  =  «  ^  i^i  =  l  '^{Yk\j]<iq+ql2}  ' 

return  Yk  ; 

} 


procedure  infer_delay  {  k,  i  )  ; 

if  [  i  ==  0  )  { 

Ak[i]  =  solveforl  (  Ak[i]  ,  (1  -  7fe[*]/iffe[*]  ==  ndGd(fe)  1  “  )  ; 

}  else  { 

ld\i]-Y]=i  l3d[i-j]Ak[j] 


Ak\i]  =  solvefor2  (  Ak[A  ,  %[*]  +  ^fe[0]  in  d£d(k) 
J2j  =  l  ^k[j]  |ndGd(fe)  “  l^d[i  —  i]j  —  l|  ==  0  )  ; 


1  - 


^dO]) 


-  1 


h  fi  _  Y[»’]-Ej=i  Af^k)\jYk[i-j] 

^  PI  _  Ak[i]-Y,]=i  Af^k)[j]Sk[i-j] 

foreach  (  j  e  d{k)  ) 

infer_delay  i  j  ,  i  )  ; 


Figure  5:  Pseudocode  eor  Inference  of  Delay  Distribution. 

larger  than  any  observed  delay  suffices)  otherwise.  The  procedure  inf  er_delay  calculates  dk{i) 
for  a  fixed  i  recursively  on  the  tree,  with  Ak[i\,k  G  V ,  i  =  0,  •  •  • ,  imax  initialized  to  0,  except  for 
Ao[0]  set  to  1.  The  output  of  the  algorithm  are  the  estimates  dk,k  . 

Within  the  code,  an  empty  product  (which  occurs  when  the  first  argument  of  inf  er  is  a  leaf) 
is  assumed  to  be  zero.  The  routines  solveforl  and  solvef or2  return  the  value  of  the  first 
symbolic  argument  that  solves  the  equation  in  the  second  argument,  solveforl  returns  a  solu¬ 
tion  in  (0, 1);  from  Lemma  1  in  [4]  this  is  known  to  be  unique,  solvef  or2  returns  the  unique 
solution  if  the  second  argument  is  linear  in  Ak{i)  ( this  happen  only  if  A;  is  a  leaf-node),  otherwise 
it  returns  the  second  largest  solution. 
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4.1  Distributed  Implementation 

As  with  the  loss  estimator  [4]  the  algorithm  is  recursive  on  trees.  In  particular,  observe  that  the 
computation  of  7  and  Ak  only  requires  the  knowledge  of  {Yj^m)jed{k),m=i,...,n',  these  are  computed 
recursively  on  the  the  tree  starting  from  the  receivers.  Therefore  it  is  possible  to  distribute  the 
computation  among  the  nodes  of  the  tree  (or  representative  nodes  of  subtrees),  with  each  node  k 
being  responsible  for  the  aggregation  of  the  measurements  of  its  child  nodes  through  (14)  and  for 
the  computation  of 

4.2  Adopting  Different  Bin  Sizes 

Following  the  results  of  the  previous  section,  we  presented  the  algorithm  using  a  fixed  value  of  q 
for  all  links.  This  can  be  quite  restrictive  in  a  heterogeneous  environment,  where  links  may  differ 
significantly  in  terms  of  speed  and  buffer  sizes;  a  single  value  of  q  could  be  at  the  same  time  too 
coarse  grained  for  describing  the  delay  of  a  high  bandwidth  link  but  too  fine-grained  to  efficiently 
capture  the  essential  characteristics  of  the  delay  experienced  along  a  low  bandwidth  link. 

A  simple  way  to  overcome  this  limitation  is  to  run  the  algorithm  for  different  values  of  q,  each 
best  suited  for  the  behavior  of  a  different  group  of  links,  and  retain  each  time  only  the  solutions  for 
those  links.  A  drawback  of  this  approach  is  that  each  distribution  is  computed  for  all  the  different 
bin  sizes.  The  distributed  nature  of  the  algorithm  suggests  we  can  do  better;  indeed,  since  Ak, 
k  ^  U,  can  be  computed  independently  from  one  another,  it  is  possible  to  compute  each  link 
delay  distribution  only  for  the  bin  size  best  suited  to  its  delay  characteristics.  More  precisely,  let 
qk  denote  the  bin  size  adopted  for  link  k.  In  order  to  compute  ak  with  bin  size  qk  we  need  to 
compute  both  Ak  and  Aj^^k)  with  bin  size  qk.  Thus,  the  overall  computation  requires  calculating 
each  cumulative  distribution  Ak  only  for  the  bin  sizes  qj,  j  G  d{k)  U  {k},  i.e.,  only  for  the  bin 
sizes  adopted  for  the  links  terminating  at  node  k  and  at  all  its  child  nodes  rather  than  for  bin  sizes 
adopted  for  all  links. 

In  an  implementation,  we  envision  that  a  fixed  value  for  all  links  is  used  first.  This  can  be 
chosen  based  on  the  measurements  spread  and  the  tree  topology  or  delay  past  history.  Then,  with 
a  better  idea  of  each  link  delay  spread,  it  would  be  possible  to  refine  the  value  of  the  bin  size  on  a 
link  by  link  basis. 

5  Experimental  Evaluation 

We  evaluated  our  delay  estimator  through  extensive  simulation.  Our  first  set  of  experiments  focus 
on  the  statistical  properties  of  the  estimator.  We  perform  model  simulation,  where  delay  and  loss 
are  determined  by  random  processes  that  follow  the  model  on  which  we  based  our  analysis.  In 


109 


our  second  set  of  experiments  we  we  investigate  the  behavior  of  the  estimators  in  a  more  realistic 
setting  where  the  model  assumption  of  independence  may  be  violated.  To  this  end,  we  perform 
TCP/UDP  simulation,  using  the  ns  simulator.  Here  delay  and  loss  are  determined  by  queuing 
delay  and  queue  overflows  at  network  nodes  as  multicast  probes  compete  with  traffic  generated  by 
TCP/UDP  traffic  sources. 

5.1  Comparing  Inferred  vs.  Sample  Distributions 

Before  examining  the  results  of  our  experiments,  we  describe  our  approach  to  assessing  the  accu¬ 
racy  of  the  inferred  distributions.  Given  an  experiment  in  which  n  probes  are  sent  from  the  source 
to  the  receivers,  for  A;  G  U,  the  inferred  distribution  aj,  (A^)  is  computed  from  the  end-to-end  mea¬ 
surements  using  the  algorithm  described  in  Section  4.  Its  accuracy  must  be  measured  against  the 
actual  data,  represented  by  a  finite  sequence  of  delays  {Dk^m}m=i  ({^fc,m}m=i)’  experienced  by 
the  probes  in  traversing  (reaching)  that  link.  For  simplicity  of  notation  we  assume,  hereafter,  that 
each  set  of  data  has  been  already  normalized  by  subtracting  the  minimum  delay  from  the  sequence. 

We  compare  summary  statistics  of  link  delay,  namely  the  mean  and  the  variance.  A  finer  eval¬ 
uation  of  the  accuracy  lies  in  a  direct  comparison  of  the  inferred  and  sample  distributions.  To  this 
end,  we  also  compute  the  largest  absolute  deviation  between  the  inferred  and  sample  c.d.f.s.  This 
measure  is  used  in  statistics  for  the  Kolmogoroff-Smimoff  test  for  goodness  of  fit  of  a  theoretical 
with  a  sample  distribution.  A  small  value  for  this  measure  indicates  that  the  theoretical  distribution 
provides  a  good  fit  to  the  sample  distribution;  a  large  value  leads  to  the  rejection  of  the  hypoth¬ 
esis.  We  cannot  directly  apply  the  test  as  we  deal  with  an  inferred  rather  than  a  sample  c.d.f.; 
however,  we  will  use  the  largest  absolute  deviation  as  a  global  measure  of  accuracy  of  the  inferred 
distributions. 

We  compute  the  sample  distributions  a  and  A  using  the  same  bin  size  q  of  the  estimator.  More 
precisely,  we  compute  (rife)  fcei/  Sind{Ak)kev  Sisdk{i)  =  EjGJV;(,)(n)  '^{D,_,e{^d-d/2,^d+d/2]} 

,i  =  0,  .  .  .  , /max  (  ri/;(cx3)  =  and  Ak{i)  =  n  '^j  =  i'^{Yk^je{id-d/2,id+d/2]} 

,1  =  0,...,  Zmax  (Afc(oo)  =  l{y;,  ^=oo})-  (Obscrve  that  in  computing  {ak)kev,  the  sum 

is  carried  out  only  over  =  {j  G  {1, . . . ,  j  <  oo},  the  set  over  which  the  delay 

along  link  k  is  defined  either  finite  or  infinite.) 

The  largest  absolute  deviation  between  the  inferred  and  sample  c.d.f.s  is,  then,  =  max;=o,...,*max 
IE!=o«fc(0  -  E!=o«fc(0l-  In  other  words,  is  the  smallest  nonnegative  number  such  that 
Ej<j  between  Ej<j  ak{i)  ±  Ak  i  =  0, . . . ,  /max-  The  same  result  holds  for  the  tail 

probabilities,  ak{i). 


no 


(a)  (b) 


Figure  6:  Two  receivers  Tree,  (a):  Simulation  topology,  (b):  Convergence  of  Q'fc(l)  to  0^(1). 
afc(O)  =  0.79,  afc(l)  =  0.2  andafc(cx3)  =  0.01  for  all  links. 


Figure  7:  AGREEMENT  BETWEEN  SIMULATED  AND  THEORETICAL  CONEIDENCE  INTERVALS, 
(a):  Results  from  100  model  simulations,  (b):  Prediction  from  (10).  The  graphs  show  two-sided 
confidence  interval  at  2  standard  deviation  for  link  1  and  2.  Parameters  are  afc(l)  =  0.2  and 
afc(cx3)  =  0.01  for  all  links. 
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5.2  Model  Simulation 


We  first  consider  the  two-leaf  topology  of  Figure  6(a),  with  source  0  and  receivers  2  and  3.  Link 
delays  are  independent,  taking  values  in  {0, 1,  (do};  if  a  probe  is  not  lost  it  experiences  either 
no  delay  or  unit  delay.  In  Figure  6(b)  we  plot  the  estimate  Q'fc(l)  versus  the  model  values  for  a 
run  comprising  10000  probes.  The  estimate  converges  within  to  2%  of  the  model  value  within 
4000  probes.  In  Figure  7  we  compare  the  empirical  and  theoretical  95%  confidence  intervals. 
The  theoretical  intervals  are  computed  from  (10).  The  empirical  intervals  are  computed  over  100 
independent  simulations.  The  agreement  between  simulation  and  theory  is  close:  the  two  sets  of 
curves  are  almost  indistinguishable. 

Next  we  consider  the  topology  of  Figure  8.  Delays  are  independently  distributed  according 
to  a  truncated  geometric  distribution  taking  values  in  {0, 1, ... ,  40,  cxd}  (in  ms)  .  This  topology 
is  also  used  in  subsequent  TCP/UDP  simulations,  and  the  link  average  delay  and  loss  probability 
are  chosen  to  match  the  values  obtained  from  these.  The  average  delay  range  between  1  and  2ms 
for  the  slower  edge  links  and  between  0.2  and  0.5ms  for  the  interior  faster  links;  the  link  losses 
range  from  1  %  to  1 1  % .  In  Figure  9  we  plot  the  estimated  average  link  delay  and  standard  deviation 
with  the  empirical  95%  confidence  interval  computed  over  100  simulations.  The  results  are  very 
accurate  even  for  several  hundred  probes:  the  theoretical  average  delay  always  lies  within  the 
confidence  interval  and  the  standard  deviation  does  so  for  1500  or  more  probes. 

To  compare  the  inferred  and  sample  distributions,  we  computed  the  largest  absolute  deviation 
between  the  inferred  and  sample  c.d.f.s.  The  results  are  summarized  in  Figure  10  where  we  plot 
the  minimum,  median  and  the  maximum  largest  absolute  deviation  in  100  simulations  computed 
over  all  links  as  a  function  of  n  (a)  and  link  by  link  for  n  =  10000  (b).  The  accuracy  increases  with 
the  number  of  probes  as  with  a  spread  of  two  orders  of  magnitude  between  the  minimum 

and  maximum.  For  more  than  3000  probes,  the  average  largest  deviation  over  all  links  is  less 
than  1%.  The  accuracy  varies  from  link  to  link:  when  the  number  of  probes  is  n  =  10000,  then 
at  one  extreme  we  have  link  4  with  0.18%  <  A4  <  0.8%  and  at  the  other  extreme  link  6  with 
0.3%  <  Ae  <  4%  over  100  simulations.  We  observe  that  the  inferred  distributions  are  less  accurate 
as  we  go  down  the  tree.  This  is  in  agreement  with  the  results  of  Section  3.4  and  is  explained  in 
terms  of  the  larger  inferred  probabilities  variances  of  downstream  with  respect  to  upstream  nodes. 

5.3  TCP/UDP  Simulations 

We  used  the  topology  shown  in  Figure  8.  To  capture  the  heterogeneity  between  edges  and  core 
of  a  WAN,  interior  links  have  higher  capacity  (5Mb/sec)  and  propagation  delay  (50ms)  then  at  the 
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1  Mb/sec,  10ms 


•  0 


•  •  •  • 
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Figure  8:  Simulation  Topology:  Link  are  of  two  types:  edge  links  of  IMB/s  capacity  and  10ms 
latency,  and  interior  links  of  5Mb/s  capacity  and  50ms  latency. 
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Figure  9:  Model  Simulation:  Topology  op  Figure  8.  Estimated  versus  Theoreti¬ 
cal  Delay  Average  and  Standard  Deviation  with  95%  conpidence  interval  com¬ 
puted  OVER  100  MODEL  SIMULATIONS. 
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Figure  10:  Model  Simulation:  Topology  op  Figure  8.  Accuracy  op  the  Estimated 
Distribution.  Largest  Vertical  Absolute  Deviation  between  estimated  and 
SAMPLE  C.D.F.  Minimum,  median  and  the  maximum  largest  absolute  deviation  in  100  simula¬ 
tions  computed  over  all  links  as  function  of  n  (a)  and  link  by  link  for  n  =  10000  (b). 


edge  (IMb/sec  and  10ms).  Each  link  is  modeled  as  a  EIEO  queue  with  a  4-packet  capacity. 

Node  0  generates  probes  as  a  20Kbit/s  stream  comprising  40  byte  UDP  packets  according  to 
a  Poisson  process  with  a  mean  interarrival  time  of  16ms;  this  represents  2%  of  the  smallest  link 
capacity.  Observe  that  even  for  this  simple  topology  with  8  end-points,  a  mesh  of  unicast  measure¬ 
ments  with  the  same  traffic  characteristics  would  require  an  aggregate  bandwidth  of  160Kbit/s  at 
the  root.  The  background  traffic  comprises  a  mix  of  infinite  data  source  TCP  connections  (ETP) 
and  exponential  on-off  sources  using  UDP.  Averaged  over  the  different  simulations,  the  link  loss 
ranges  between  1%  and  11%  and  link  utilization  ranges  between  20%  and  60%. 

Eor  a  single  experiment,  Eigure  1 1  compares  the  estimated  versus  the  sample  average  delay 
for  representative  selected  links.  The  analysis  has  been  carried  out  using  d  =  1ms  (a)  and  d  = 
0.1ms  (b).  In  this  example,  we  practically  obtain  the  same  accuracy  despite  a  tenfold  difference 
in  resolution.  (Observe  that  d  =  1ms  is  of  the  same  order  of  magnitude  of  the  average  delays.) 
The  inferred  averages  rapidly  converge  to  the  sample  averages  even  though  we  have  persistent 
systematic  errors  in  the  inferred  values  due  to  consistent  spatial  correlation.  We  shall  comment 
upon  this  later. 

In  order  to  show  how  the  inferred  values  not  only  quickly  converge,  but  also  exhibit  good  dy¬ 
namics  tracking,  in  Eigure  12  we  plot  the  inferred  versus  the  sample  average  delay  for  3  links  (1, 
3  and  10)  computed  over  a  moving  window  of  two  different  sizes  with  jumps  of  half  its  width.  To 
allow  greater  dynamics,  here  we  arranged  background  sources  with  random  start  and  stop  times. 
Under  both  window  sizes  (approximately  300  and  1200  probes  are  used,  respectively),  the  esti- 
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Figure  11:  Convergence  of  inferred  versus  sample  average  link  delay  in 
TCP/UDP  Simulations,  (a):  bin-size  o?  =  1ms.  (b):  bin  size  d  =  0.1ms.  The  graphs  shows 
how  the  inferred  values  closely  track  the  sample  average  delays. 


(a)  (b) 


Figure  12:  DYNAMIC  ACCURACY  OF  Inference.  Sample  and  Inferred  average  delay  on  links 
1,  3  and  10  of  the  multicast  tree  in  Figure  8.  (a):  5  seconds  window,  (b):  20  seconds  windows. 
Background  traffic  has  random  start  stop  times. 
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Figure  13:  Accuracy  of  Inference:  Average  Delay.  Left:  d  =  1ms.  Right:  o?  =  0.1ms. 
The  graphs  show  the  normalized  Root  Mean  Square  error  between  the  estimated  and  sample  aver¬ 
age  delay  over  100  simulations. 


mates  of  the  average  delays  of  links  1  and  10  show  good  agreement  and  a  quick  response  to  delay 
variability  revealing  a  good  convergence  rate  of  the  estimator.  For  link  3  with  a  smaller  average 
delay,  the  behavior  is  rather  poor,  especially  for  the  5  seconds  windows  size. 

For  a  selection  of  links,  in  Figure  13  we  plot  the  Root  Mean  Square  (RMS)  normalized  error 
between  the  estimated  and  sample  average  delays  calculated  over  100  simulations  using  d  =  1ms 
and  d  =  0.1ms.  The  two  plots  demonstrate  that  the  error  drops  significantly  up  to  2000  probes 
after  which  it  becomes  almost  constant.  In  this  example,  increasing  the  resolution  by  a  factor  of 
ten  improves,  although  not  significantly,  the  overall  accuracy  of  the  estimates  especially  for  those 
links  that  enjoy  smaller  delays.  After  10000  probes  the  relative  error  ranges  from  1%  to  23%.  The 
higher  values  occur  when  link  average  delays  are  small  due  to  the  fact  that  for  these  links  the  same 
absolute  error  results  in  a  more  pronounced  relative  error. 

The  persistence  of  systematic  errors  we  observe  in  Figure  13  is  due  to  the  presence  of  spatial 
correlation.  In  our  simulations,  a  multicast  probe  is  more  likely  to  experience  similar  level  of  con¬ 
gestion  on  consecutive  links  or  on  sibling  links  than  is  dictated  by  the  independence  assumption. 
We  also  verified  the  presence  of  temporal  correlation  among  successive  probes  on  the  same  link 
caused  by  consecutive  probes  experiencing  the  same  congestion  level  at  a  node. 

To  assess  the  extent  to  which  our  real  traffic  simulations  violate  the  model  assumptions,  we 
computed  the  delay  correlation  between  links  pairs  and  among  packets  on  the  same  link.  The  anal¬ 
ysis  revealed  the  presence  of  significant  spatial  correlations  up  to  0.3  ~  0.4  between  consecutive 
links.  The  smallest  values  are  observed  for  link  5  which  always  exhibits  a  correlation  with  its 
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parent  node  that  lies  below  0.1.  From  Figure  13  we  verify  that,  not  surprisingly  node  5  enjoys 
the  smallest  relative  error.  We  believe  that  these  high  correlations  are  a  result  of  the  small  scale 
of  the  simulated  network.  We  have  observed  smaller  correlations  in  large  simulations  as  would  be 
expected  in  real  networks  because  of  the  wide  traffic  and  link  diversity. 

The  autocorrelation  function  rapidly  decreases  and  can  be  considered  negligible  for  a  lag  larger 
than  30  (approximatively  2  seconds).  The  presence  of  short-term  correlation  does  not  alter  the  key 
property  of  convergence  of  the  estimator  as  it  suffices  that  the  underlying  processes  be  stationary 
and  ergodic  (this  happens  for  example,  when  recurrence  conditions  are  satisfied).  The  price  of 
correlation,  however,  is  that  the  convergence  rate  is  slower  than  when  delay  are  independent. 

Now  we  turn  our  attention  to  the  inferred  distributions.  For  an  experiment  of  300  seconds 
during  which  approximately  18000  probes  were  generated,  we  plot  the  complementary  c.d.f.  con¬ 
ditioned  on  the  delay  being  finite  in  Figures  14.  In  Figure  15  we  also  plot  the  complement  c.d.f 
of  the  node  cumulative  delay,  (we  show  only  the  internal  links  as  Ak{i)  =  %{i),  k  G  R).  Here 
d  =  1ms. 

From  these  two  sets  of  plots,  it  is  striking  to  note  the  differences  between  the  accuracy  of  the 
estimated  cumulative  delay  distributions  Ak  and  the  estimated  link  delay  distributions  Rk :  while  the 
former  are  all  very  close  to  the  actual  distributions,  the  latter  results  are  inaccurate  in  many  cases. 
This  is  explained  by  observing  that  in  presence  of  significant  correlations,  the  convolution  among 
Ak,  ak,  and  Aji^k),  used  in  the  model,  does  not  well  capture  the  relationship  between  the  actual 
distributions.  We  verified  this  by  convolving  ak  and  Af(k)  and  comparing  the  result  with  Ak',  as 
expected,  in  the  presence  of  strong  local  correlation,  the  results  exhibit  significant  differences  that 
account  for  the  discrepancies  of  the  inferred  distributions.  Nevertheless,  results  should  be  affected 
in  a  continuous  way  with  small  violations  leading  to  small  inaccuracies.  Indeed,  we  have  good 
agreement  for  the  inferred  distributions  of  links  4,  5,  10  and  12  that  are  the  nodes  with  smallest 
spatial  correlations.  Unfortunately  it  is  not  easy  to  determine  whether  the  correlations  are  strong 
and  therefore  assess  the  expected  accuracy  of  the  estimates,  even  though  pathological  shapes  of  the 
inferred  distributions  could  provide  evidence  of  strong  local  correlations  ^  A  solution  could  be  the 
extension  of  the  model  to  explicitly  account  for  the  presence  of  spatial  correlation  in  the  analysis. 
This  will  be  the  focus  of  future  research. 

The  accuracy  of  the  inferred  cumulative  delay  distributions,  on  the  other  hand,  derives  from 
the  fact  that  even  in  presence  of  significant  local  correlations,  equation  (8),  which  assumes  inde- 

'To  this  end,  we  observed  that  under  strong  spatial  correlation  inaccuracies  of  the  estimator  d  are  often  associated 
to  the  existence  of  significant  increasing  behavior  portions  in  the  complement  c.d.f.  that  reveals  the  presence  of 
negative  inferred  probabilities  with  possibly  non  negligible  absolute  values. 
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Figure  14:  Sample  vs.  Estimated  Delay  c.d.f.  for  selected  links. 
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Figure  15:  Sample  vs.  Estimated  node  k  cumulative  delay  c.d.f. 


119 


§  1e+00 


■> 

o 

Q 

"cC 

o 

■■C 

o 

> 


< 


1e-01  ^ 


1e-02 


1e-03 


Maximum 

Median 

Minimum 


§  1e+00 


2000 


4000  6000 

n.  of  probes 

(a) 


■> 

0 

Q 

"cO 

o 

■■C 

0 

> 


< 


8000  10000 


1e-01  ^ 


1e-02 


1e-03 


Maximum 

Median 

Minimum 


/'  / 


6  7 

Link 


8  9  10  11  12 


(b) 


Figure  16:  TCP/UDP  SIMULATION:  Topology  of  Figure  8.  Accuracy  of  the  Esti¬ 
mated  Distribution.  Largest  Vertical  Absolute  Deviation  between  estimated 
AND  THEORETICAL  C.D.F.  Minimum,  median  and  the  maximum  largest  absolute  deviation  in  100 
simulations  computed  over  all  links  as  function  of  n  (a)  and  link  by  link  for  n  =  10000  (b). 


pendence,  is  still  accurate.  This  can  be  explained  by  observing  that  (8)  is  equivalent  to  (4)  which 
consists  of  a  convolution  between  Aj(fc)  and  f3k',  we  expect  the  correlation  between  the  delay  ac¬ 
crued  by  a  probe  in  reaching  node  f{k)  and  the  minimum  delay  accrued  from  node  f{k)  to  reach 
any  receiver  be  rather  small,  especially  as  the  tree  size  grows,  as  these  delays  span  the  entire  mul¬ 
ticast  tree. 

Finally  in  Figure  16  we  plotted  the  minimum,  median  and  maximum  largest  deviation  between 
inferred  and  theoretical  c.d.f.  over  100  simulations  computed  over  all  links  as  function  of  n  (left) 
and  link  by  link  as  for  n  =  10000  (right).  Due  to  spatial  correlation,  the  largest  deviation  level 
off  after  the  first  2000  probes  with  the  median  that  stabilize  at  5%.  The  accuracy  again  exhibits  a 
negative  trend  as  we  descend  the  tree. 


6  Conclusions  and  Future  Work 

In  this  paper,  we  introduced  the  use  of  end-to-end  multicast  measurements  to  infer  network  internal 
delay  in  a  logical  multicast  tree.  Under  the  assumption  of  delay  independence,  we  derived  an 
algorithm  to  estimate  the  per  link  discrete  delay  distributions  and  utilization  from  the  measured 
end-to-end  delay  distributions.  We  investigated  the  statistical  properties  of  the  estimator,  and  show 
it  to  be  strongly  consistent  and  asymptotically  normal. 

We  evaluated  our  estimator  through  simulation.  Within  model  simulation  we  verified  the  ac¬ 
curacy  and  convergence  of  the  inferred  to  the  actual  values  as  predicted  by  our  analysis.  In  real 
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traffic  simulations,  we  found  rapid  convergence,  although  some  persistent  difference  to  the  actual 
distributions  because  of  spatial  correlation. 

We  are  extending  our  delay  distribution  analysis  in  several  directions.  First  we  plan  to  do  more 
extensive  simulations,  exploring  larger  topologies,  different  node  behavior,  background  traffic  and 
probe  characteristics.  Moreover,  we  are  exploring  how  probe  delay  is  representative  of  the  delay 
suffered  by  other  applications  and  protocols,  for  examples  TCP. 

Second,  we  are  analyzing  the  effect  of  spatial  correlation  among  delays  and  we  are  planning 
to  extend  the  model  by  directly  taking  into  account  the  presence  of  correlation.  Moreover,  we 
are  studying  the  effect  of  the  choice  of  the  bin  size  on  the  accuracy  of  the  results.  To  deal  with 
continuously  distributed  delay,  we  derived  a  continuous  version  of  the  inference  algorithm  we  are 
currently  investigating. 

Finally,  we  believe  that  our  inference  technique  can  shed  light  on  the  behavior  and  dynamics 
of  per  link  delay  and  so  allow  us  to  develop  accurate  link  delay  models.  This  will  be  also  object  of 
future  works. 

We  feel  that  multicast  based  delay  inference  is  an  effective  approach  to  perform  delay  mea¬ 
surements.  The  techniques  developed  are  based  on  rigorous  statistical  analysis  and,  as  our  results 
show,  yield  representative  delay  estimates  for  all  traffic  which  receive  the  same  per  node  behavior 
of  multicast  probes.  The  approach  does  not  depend  on  cooperation  from  networks  elements  and 
because  of  bandwidth  efficiency  of  multicast  traffic  is  well  suited  to  cope  with  the  growing  size  of 
today  networks. 
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A  Uniqueness  and  Continuously  Differentiability  of  the  Inverse 


The  algorithm  presented  in  Section  3.2.2  computes  a  solution  of  the  system  of  equations  (5)  in  the 
unknown  A  =  {Ak{i))kev,te{o,...,tra..}  given  7  =  (7fc(0)fcev,*e{o,...,wx}-  By  deconvolution  we  then 
compute  o  {(^k{'l'))kev,ie{o,...,im!^x}■ 

We  now  show  that  solution  so  computed  is  the  unique  solution  of  the  equation  7  =  r(a),  i.e. 
that  it  is  uniquely  defined  the  inverse  a  =  r“^(7).  To  this  end  we  rewrite  the  mapping  a  =  r(7)  as 
7  =  70  tp{a)  where  A  =  tp{o)  is  clearly  a  bijection.  It  remains  to  show  that  7  is  also  a  bijection. 
To  prove  this  consider  a,  A  and  7  such  that  A  =  tp{o)  and  7  =  7(A).  We  first  show  that  Ak{i), 
i  =  1, . . . ,  Vax  is  the  second  largest  solution  of  (8). 

In  the  binary  case  we  can  directly  solve  (8)  to  obtain  the  two  solutions 

and 

Afc(^) 

For  the  general  case  we  have  the  following  Lemma 


Lemma  1  Let  xi  >  X2  >  . . .  >  x^,  rn  <  4^d{k)  denote  the  real  solutions  of  the  equation 


7fc(*)  +  ^fc(o)  <  ridGdCfc) 


1  - 


=  \  Pdd-j)dikU)-Pd(0)x 
Ak{0) 


-1  + 


Y.]=\  Mj)  {nded(fc)  [1  -  f^d{i  Ax  {nded(fc)  “  f^d{0)]  -  l}  =  0. 

Then  X2  =  Ak{i). 


(16) 


Proof:  Substitute  x  =  Ak{i)  A  yAfc(O)  in  equation  (16)  obtaining 

lk{i)  A  A/;(0)  {n  d^d{k)  [1  -  [id{i)  A  /3d(0)y]  -  l}  + 

e;;'.  Mj)  {rijewi  m  -  j)i  - 1}  +  (.4^(1)  +  94,(0))  11  -  ,340)1  - 1}  =  0. 

(17) 

To  prove  the  lemma  we  simply  need  to  show  that  y  =  0  is  the  second  largest  solution  of  (17). 
Expanding  the  product  in  the  second  term  we  get 

7fc(0  +  Afc(O)  jridGdCfc)  [i  “  f^d{i)]  “  i}  +  "4fc(0)  nmG{l,...,#d(fc)}(i  “  (*) )  (0)  ^  ^"y^ 

E}=i  Ak{j)  |n  dEd[k)  [1  ~  dd{i  —  j)]  “  i}  A  {Ak{i)  +  yAk{0))  |]^  dEd[k)  [1-/34(0)]-i}  =  0. 
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where  b  =  {6i, . . . ,  b:)^d(k)},  B  =  {0, \  Observing  that  the  constant  terms  sums 

to  0  (equation  (5)  and  dividing  by  Afc(O))  this  reduces  to 

"^beB  nmG{l,.--,#(i(fc)}(^  “  +  y  jridGdCfc) 

Grouping  with  respect  to  y\  we  obtain 

#d(fc)  r  'I 

hAv)  =  E  y'  E  n  n  [1  -  -1=0. 

/=1  beB ,J2bm  =  if^d(k)-l  me{l,--;if^d(k)}  yded(k)  J 

The  coefficients  of  the  polynomial  are  all  positive  but  the  last  which  is  negative.  The  proof  follows 
observing  that  since  6k AA  =  0’  ^AA)  <  0  ^Aiv)  >  0>  y  >  0>  there  is  one  and  only  one 
solution  of  (17)  greater  than  zero.  ■ 


From  the  uniqueness  of  the  solution  of  (6)  for  canonical  delay  trees  and  by  induction  on  i,  it 
then  follows  that  x  is  a  bijection;  thus,  the  inverse  is  uniquely  defined. 

To  prove  that  the  inverse  is  continuously  differentiable  we  proceed  as  follows.  Denote  6k  Al^ 
k  ^  U  ,i  =  ■  ■  ■  ■,  imax,  the  left  hand  side  of  (8).  Define  the  function  77(7,  =  {6k, jA-,  A))keu,j=o,...,t 

H{'y^  A)  =  0  is  the  system  of  equations  to  be  solved  to  compute  A  given  7.  Denote  ^(7)  the  unique 
solution  to77(7,A(7))  =  0.  The  proof  that  the  inverse  is  continuously  differentiable  amounts  to 
show  that  so  is  A  =  x“^(7)  =  A(7)  (as  V’  and  its  inverse  clearly  are).  For  canonical  trees, 
Afc(O)  >  0,  A;  G  77,  and  therefore  77  is  continuously  differentiable.  Then,  by  the  Implicit  Function 
Theorem,  so  is  A(7)  provided  that  the  determinant  of  the  Jacobian  Ax\A-A('-t)  different  from 
zero.  To  see  this,  observe  that  =  0  if  7i  A;2  or  A  <  A;  hence  the  Jacobian  matrix  is 


always  triangular.  The  diagonal  elements  are 


d0k,t(l,A) 


dAu 


A=A('i) 


=  «i,,(0)/-V(0)  <  0. 


B  The  Continuous  Model:  Delay  Distribution  Analysis 

In  this  Appendix  we  formulate  the  delay  analysis  for  continuous  delay  distributions,  rather  than 
for  the  discrete  distributions.  We  assume  now  that  Dk  G  IR+  U  {00}  and  the  distribution  to  be 
absolutely  continuous  w.r.t  Lebesgue  measure  on  E+  with  density  o^,  together  with  an  atom  at 
00  of  mass  afc(cx3)  =  1  —  fA  ak{x)dx,  the  probability  that  a  packet  is  lost  traversing  the  link 
terminating  at  k.  (For  simplicity  of  notation,  here  we  do  not  consider  the  atom  in  0  representing 
the  probability  that  a  probe  experiences  minimum  delay.)  Ak  is  defined  similarly  for  the  source  to 
root  delays  Yk.  Similar  to  the  discrete  case  we  define  ^k{x)  as  the  event  {minjg7^(fc)  T)  <  x}  and 
7fc(x)  =  P[Ofc(a;)],  k  e  U.  Finally,  let  AiA  =  P[minjgj?(fc)  T)-  -  Yf(k)  <  x],  k  e  U.  From  the 
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above  definitions,  the  following  relations  hold: 

px 

Ak{x)  =  /  ak{y)Af(k){x  -  y)dy,  k  e  U, 

Jo 

where  we  set  Aq{x)  =  8{x), 


Pk{x)  =  ja  (id{x-y)])dy,  k  (^U  (18) 


and 

Jk{x)  =  Ak{y){l-]\^^^(^^[l- f3d{x-y)])dy,k(^U.  (19) 

which  is  the  continuous  version  of  (5).  Empty  products  are  assumed  to  be  equal  to  zero. 

The  above  equation  can  be  rewritten  in  more  convenient  form  using  the  Laplace  transform 


7fc(s)  =  Afc(s) 


*ded{k) 


Jdjs)  \ 


k  e  u 


(20) 


or 


1  _  7fc('g) 
s  Ak{s) 


^dE.d[k) 


1 

S 


,  k  eu 


(21) 


where,  f{s)  =  f(x)e  is  the  Laplace  transform  of  f(x),  *  is  the  convolution  operator  in 
the  domain  s,  f(s)  *  g(s)  =  f(p)g(s  -  p)dp,  and  (3d{s)  =  7^. 

Given  7^(5),  k  ^  U,  (21)  represents  a  system  of  independent  equations  in  the  unknown 
Ak{s),  k  ^  U.  ak{s)  can  then  be  computed  by  the  quotients 


afc(s) 


^^(■5) 

d^f{k)i^) 


,  k  eu. 


Solving  equation  (21)  is  not  trivial  especially  because  of  the  convolution  in  the  right  hand  side 
which  in  general  can  be  computed  only  numerically.  Lurthermore,  we  have  not  been  able  yet 
to  establish  whether  the  solution  is  unique.  The  inversion  of  the  Laplace  transform  poses  other 
challenges.  It  is  well  known,  indeed  (see  for  example  [1]),  that  the  Laplace  inverse  transform  is  an 
unbounded  operator.  In  other  words  arbitrary  small  changes  in  the  transform  will  produce  arbitrary 
large  changes  in  the  original  function.  Therefore  it  may  not  be  easy  to  control  the  accuracy  of  the 
results  obtained  with  such  an  approach.  All  these  issues  will  be  subject  of  further  investigation. 
The  estimator  a(s)  can  be  computed  in  the  same  manner  from  the  estimates  7(5)  we  obtain  from 
the  measurements.  (21)  can  be  written  as  a  fixed  point  equation  for  the  Afc(s);  this  suggests  the 
possible  use  of  contraction  mapping  theorem  in  order  to  establish  existence  and  uniqueness  of 
solutions. 
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C  Proof  of  Theorem  4 


The  proof  proceeds  by  a  number  of  subsidiary  results. 

C.l  Limit  Behavior  of  A,  (3  and  7 


As  1 1  o  1 1  — y  0, 

(i) 


Ak,.}  1-"40)  +  O(||a|n  *  =  0 

1  ES«A7)(0  +  O(||a||^)  ^>0 

(22) 

(ii) 

1  -  3k{i)  =  '^ak{i)  +  0(11  a|n 

(23) 

j>i 

(iii) 

7fc(z)  =  1  -  Sk{i)  +  0(||a||^) 

(24) 

where 

e(k) 

■5fc(0  = 

(25) 

1=0  jyi 


The  relation  (i)  is  clear  for  i  =  0  by  expanding  Afc(O)  =  IlS  afc(O);  for  i  >  0,  it  follows  by 
an  inductive  argument  on  k  and  i:  it  is  true  for  A;  =  1  and  i  =  1;  if  it  is  true  for  j  y  k  and  for 

0  <  i'  <  i  —  1,  then 

*  d/(fc)) 

j)  =  (0)afc(i)+Aj(;.)(/)a/;(0)+O(||  a  11^)  =  ak{i)+  (i)+0(||  a  ||^). 

j=0  1=0 

Also  (ii)  follows  by  an  inductive  argument.  Observe  from  (3)  that  if  (ii)  hold  for  all  j  in  d{k) 
and  i'  <  I  then  1  -  (3kA)  =  1  -  Ej=o  “^0')  +  0{\\a  ||^).  Since  AA)  =  Y3]=o  for  k  R, 
i  =  1, . . . ,  imax,  (ii)  holds  for  all  k  and  i.  (iii)  then  follows  expanding  7fc(i)  =  “ 

'\\ded(k)3d(^  ~  i)]  (Observe  that  the  terms  within  square  bracket  are  always  of  the  form  1  — 

0(||«||)). 

C.2  Limit  Behavior  of  the  Covariance  Matrix  u 

As  1 1  o  1 1  — y  0, 

Coy{Aki  (*i),7fc2(*2))  =  SfcjAfc2(max(zi,Z2))  +  (9(||a||^)  (26) 

where  ki  A  k2  denote  the  minimal  common  ancestor  of  ki  and  k2  with  respect  to  A . 
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To  see  this,  we  write  Cov(%j(zi),7fc2(*2))  =  By  defini¬ 
tion,  E[7fcj(zi)]  =  7fcl(^l)  and  £[7/,^  (^2)]  =  P[aimmeR{ki)  Ym  <  ug  n  min„gj^(fc2)  <  i2q] 

We  have  three  cases: 


(i)  ki  y  k2,  ii  >  i2- 

In  this  case  {ramrneR{k2)Ym  <  *2*?}  C  {min„gj^(fc7  <  iiq}  and  £[7/,^  (^2)]  =  7fc2(*2). 

Thus, 

Cov(7fc^(^l),7fc2(^2))  =  (1  -  7fci(*i))7fc2(*2)  =  +  0(11  a  |H. 

(26)  follows  as  ii  =  max(ii,  ^2)  and  ki  =  ki  f\k2. 


(ii)  k,  y  k2,  ii  <  i2- 

Write  P[min„gj^(fcj)  F„  <  iiqr\inm^<^R^k^)  Ym  <  i2q]  =  P[knm^eR{ki)  Ym  <  *ig]  +  P[min„e_R(fc2)  Ym  < 
i2q]  -  P[mmmeR{ki)  Y^  <  iiqU  mmmeR{k2)  Ym  <  i2q]  -  The  first  two  terms  are  7^1  (zi)  and  7^2  (*2). 
Then,  as  for  ||  a  ||  — >  0 

E[7fci(*i)]£[7fc2(*2)]  =  1  -  Sk.iii)  -  Sk2ii2)  +  0(11  a  in,  (27) 


it  readily  follows  that  as  ||  a  ||  — >  0 


Cov(7fci(*i),7fc2(*2)) 


1  -  P[  min  Ym<iiqU  min  <  ^27]  +  0(||  a  H^). 

mE.R{ki)  7tiG-R(A:2) 


(28) 


To  compute  P[min,„g7^(fcj)  Ym  <  iiq^knmmeR{k2)  Ym  <  O7]  we  need  to  define  some  additional 
quantities.  Denote  IF  =  {^el, . . . ,  C  V  a  set  of  nodes  that  induces  a  partition  on  R,  i.e.,  W  is 
such  that  \Jj^^-^R{wh)  =  R  and  R{wh)  fl  R{wh')  =  <  h^h'  <l,h  ^  h' .  Associate  to  IF  a  set 

of  delay  values  A  =  {ji, . . . ,  j)}.  The  quantities  we  introduce  below  are  a  generalization  of  the 
(3  and  7,  where  we  use  for  different  sets  of  receivers,  namely  R{wi  )^ . . . ,  R{wi),  different  delays, 
ji, . . .  ^ji.  For  k  e  V,  define 


Xfc,w(A)  =  Pn^i  min  F„  -  Yf^k)  <  */.?]• 

wE.R{ujh)(~\R[k) 


Then,  Xk,w  obey  to  the  recursion 


Xfc,m(A)  =  ^  ak{j) 

2=0 

Xk,w{JY)  =  (3k{jk,w),  k  ^  R 


1  -  n  ^  “  y^d,w{^-  j) 

d^d{k) 


,  keU\R 


^Since  probes  are  assumed  independent,  it  suffices  to  evaluate  all  random  quantities  for  n  =  1  probes. 


(29) 


(30) 

(31) 


127 


where  A- j  =  {ji  and  jk,w  =  Ji’  •  Probabilities  with  negative 

index  are  assumed  to  be  equal  to  zero. 

For  a  given  node  k,  define  now 


rik,w{A)  =  min  <  jhq].  (32) 

w^R{wfi)C\R(^k) 

The  following,  which  can  be  regarded  as  a  generalization  of  (5),  holds 

jk,W 

r]k,wiA)  =  Akjj) 

]=0 
jk,W 

r]k,w{A)  =  '^Ak{j),keR.  (34) 

]=0 

For  II  a  II  — >  0,  it  is  easy  to  verify  that  rik,w{A)  behaves  as  ^k{jk,w)  (the  terms  within  square 
bracket  are  always  of  the  form  l  —  0{\\a  ||)).  In  other  words,  rik,w{A)  =  1  —  Sk{jk,w)  +  0(||  a  |y. 
With  the  definitions  above,  we  can  now  write 


1  -  n  ^  “  xd,w(A  -  j) 

dE:d[k) 


keU\R 


(33) 


P[  min  Ym  <  iiqU  min  Y^  <  ^2<^]  =  qki,w{A)  (35) 

mE:R{ki)  mE:R{k2) 

where  W  =  {wi^ . . . ,  wi},  wi  =  k2,  and  A  =  {ji, . . . ,  j)},  with  ji  =  ^2  and  jv  =  ii,  2  <  /'  <  1. 
Then,  as  (||  a  ||)  — >  oo, 

P[  min  Tm<*i<?U  min  Tm  <  *2<?]  =  1  -  .Sfci  (^2)  +  0(||  a  H^) •  (36) 

m^Ri^ki)  m^R{^k2) 


Thus, 


Cov(7fcj(zi),7fc2(*2))  =  Skd^2)  +  0(11  a|y. 


(37) 


(26)  follows  as  ki  =  ki  ^  k2  and  ^2  =  max(ii,  ^2)- 


(iii)  ki  /\k2  y  ki^k2. 

We  proceed  as  for  (ii).  In  this  case,  we  can  write 

P[  min  Y^  <  iiqU  min  Y^  <  i2q]  =  qkiAk2,w{A)  (38) 

mER{ki)  mER{k2) 

where  W  =  {wi^ . . . ,  wi},  wi  =  ki  and  W2  =  k2,  and  A  =  {ji, . . . ,  j)},  with  ji  =  ii,  j2  =  ^2  and 

j//  =  —  1 ,  3  <  /'  <  /  (for  (38)  to  hold,  we  need  to  set  j)  to  - 1 ,  /  ^  1 ,  2,  or  any  other  negative  number, 
to  insure  that  all  events  regarding  receivers  different  from  R{ki )  and  R{k2 )  have  probability  zero). 
Thus,  as  II  a  II  — >  00, 

P[  min  Tm<*i<?U  min  Y^  <  i2q]  =  1  -  SkiAk2{^^^{H,i2))  +  0{\\a\\‘^).  (39) 

mE.R{ki)  mE.R{k2) 
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Thus, 


2 


(40) 


Cov(7fc^(^l),7fc2(^2))  =  SfcjAfc2(max(zi,Z2))  +  0{\\q, 

C.3  Limit  Behavior  of  the  Jacobian  D{a) 

As  1 1  cii  1 1  — y  0, 

f  1  k2  =  ki 

T»(a)  =  5  (g)  +  0(11  a||)  where  =  <  -1  k2  =  f{ki)  ,  (41) 

I  0  otherwise 

where  B  is  a  Vax  +  2  x  +  2  matrix  with  entries  Bai  =  Sai  —  Sai^i,  g)  denote  the  Knonecker 
product  and  Sai  =  1  if  i  =  i'  and  0  otherwise. 

To  establish  this,  we  first  show  that  its  inverse  D~^  (a)  whose  elements  are  {D~^  (®))(fci,n)(fc2,*2)  = 
d"/ki  {ii) I dak^{i2)  has  the  following  form  for  ||  a  ||  — >  0, 

T)“^(a)  =  T  (g)  Z)  +  0(||a||)  where  =  I  ^  ^  ^^2) 

[  0  otherwise 

where  Z  is  a  unit  lower  triangular  matrix,  i.e.,  Lai  =  l{i<x}-  To  this  end  we  rewrite  7fc(i)  as 

—  fid{i  —  j))]-  We  have  the  following  three  cases: 


(i)  ki  A  Z2,  ii  >  i2- 

Let  I'  be  such  that  f  (ki)  =  Z2.  Then,  for  ||  a  ||  — >  0 


^7fci(L) 

dak2{i2) 


na-A.(n-2))i 

j=*2  ded{ki) 


E 


d 


( 


i{ki) 


\ 


s  n"z(L)U)|[i-  Yi  {1  -  mh  -  j))] 

dEd{ki ) 


i(ki)  .  .  1=0 

1=0  dl=J 


) 


I 

2=«2 
£{ki) 

«/7fci)(0)[l  “  (1  — /3(i(*i  —  E))]  + 

1=0, d£d(ki) 
n  i{ki) 

s  s  n  "g(L)(j0[i-  n  -  dd{ii  -  j))] 

j=,,+i  1=0, i^i'  ded(k,) 

i+o(ii«ii) 


(43) 

(44) 


(45) 

(46) 
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as  the  first  term  of  (45)  goes  to  1  while  the  second  goes  to  0  because  in  any  product  there  is  at  least 
one  I  such  that  j)  >  0. 


(ii)  ki  y  k2,  ii  >  ^2* 

Let  d'  be  such  that  there  exists  an  I  such  that  f{k2)  =  d' .  Then, 


dak^{i2) 


4  n(iGd(fci)(^  /5d(*l  i))] 

2^  Ak^(J) - 


2=0 
n  -«2 


«fc2(*2) 


2=0  ^2  I  2j  ded(ki),dj^d’ 

Oi\\a\\) 


as  each  product  goes  to  0,  as  ||  a  ||  — >  0. 


(47) 

(48) 

(49) 


(iii)  ii  <  i2. 

In  this  last  case,  7^^  (f  1)  does  not  depend  on  ak^  (^2)  and  the  derivative  is  0. 

Since  matrix  inversion  is  continuous  in  an  open  neighborhood  on  non-singular  matrices,  then 
(41)  follows  since  D  and  D  are  inverses  (see  Section  10  of  [4])  as  also  are  L  and  B  (trivial)  and 
since  for  invertible  square  matrices  F  and  G,  {F  (g)  G)-!  =  F-i  (g)  G-\ 


C.4  Proof 


From  Theorem  3,  (26),  (41)  and  continuity  of  finite  dimensional  matrix  products,  we  have  for 

II  a  II  — >  0  that 


^(ki,ii)(k2,i2) 


'y  y  *2))-^(fc2,*2)(fcg*2)  )• 

fcj  1^2  d'l  ’*2 


It  remains  to  evaluate 


(50) 


k'k'i' 


^iki,ii)ik[,i[)'^k[,k2  (niax(i]^ ,  *2))-^(fe2 


,j2)(fe77) 


X]  -S*'i*'i[sfeiAfe2(max(i(,  4))  -  (max(i( ,  4))  - 

*i.*2 

SfeiA/(fe2)(max(i(,i'2))  +  s;(fe^)A/(fe7(max(i(,i'2))]5i^(|l) 


When  ki  ^  k2,  then  (51)  yields  0.  Indeed,  if  k2  >~  ki,  ki  A  k2  =  f{ki)  A  k2  =  k2,  while 
ki  A  f{k2)  =  f{ki)  A  f{k2)  =  f{k2),  and  hence  (51)  is  zero.  Similar  for  ki  A-  k2.  In  all  other 


130 


cases,  ki  A  k2  y  ki,k2  and  so  ki  A  k2  =  f{ki)  A  k2  =  ki  A  f{k2)  =  f{ki)  A  f{k2)  =  f{k2)  and 

(51)  is  again  zero. 

When  ki  =  k2,  ki  A  k2  =  ki,  ki  A  f{k2)  =  f{ki)  A  k2  =  f{ki)  A  f{k2)  =  f{ki)  and  (51) 
reduces  to 

X!^n*'i['5fci(max(z'i,Z2))  -  s/(fcj)(max(z'i,  Z2))]5,2*'  (52) 

i[  ,12 

Substituting  Bij  =  Sij  —  Sij^i  in  (52),  it  is  easy  to  verify  the  following: 


X!^n*'i['5fci(max(z'i,Z2))  -  s/(fcj)(max(z'i,  Z2))]5, 


*2*2 


Ej>o«fci(2)  *i=*2  =  0 

afci(*i)  7^  0,  Z2  =  0 

afci(*2)  *2  =  0,  Z2  7^  0 

0  *1  7^  *2?  *17*2  7^  0 

(53) 
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Multicast  Inference  of  Packet  Delay  Variance  at 

Interior  Network  Links 

N.G.  Duffield^  F.  Lo  Presti^’^ 
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A^ifracf 6  End-to-end  measurement  is  a  common  tool  for 
network  performance  diagnosis,  primarily  because  it  can  re- 
ect  user  experience  and  typically  requires  minimal  support 
from  intervening  network  elements.  Challenges  in  this  ap¬ 
proach  are  (i)  to  identify  the  locale  of  performance  degrada¬ 
tion;  and  (ii)  to  perform  measurements  in  a  scalable  man¬ 
ner  for  large  and  complex  networks.  In  this  paper  we  show 
how  end-to-end  delay  measurements  of  multicast  traf  c  can 
be  used  to  estimate  packet  delay  variance  on  each  link  of  a 
logical  multicast  tree.  The  method  does  not  depend  on  coop¬ 
eration  from  intervening  network  elements;  multicast  prob¬ 
ing  is  bandwidth  ef  dent.  We  establish  desirable  statistical 
properties  of  the  estimator,  namely  consistency  and  asymp¬ 
totic  normality.  We  evaluate  the  approach  through  model 
based  and  network  simulations.  The  approach  extends  to 
the  estimation  of  higher  order  moments  of  the  link  delay  dis¬ 
tribution. 

/Teyworfifio End-to-end  measurement,  queueing  delay,  es¬ 
timation  theory,  multicast  trees,  network  tomography 

I.  Introduction 
A.  Background  and  Motivation. 

Monitoring  the  performance  of  large  communications 
networks  and  diagnosing  the  causes  of  its  degradation  is 
a  challenging  problem.  There  are  two  broad  approaches 
to  performance  diagnosis.  In  the  internal  approach,  direct 
measurements  are  made  at  or  between  network  elements, 
e.g.  of  packet  loss  or  delay.  This  approach  has  a  number 
of  potential  limitations:  it  may  not  be  available  for  gen¬ 
eral  users;  coverage  may  not  span  paths  of  interest;  mea¬ 
surements  may  be  disabled  during  period  of  high  load;  is¬ 
sues  of  scale  gathering  and  correlating  the  measurements 
in  large  networks;  how  to  compose  per  hop  measurements 
to  and  end-to-end  view. 

This  motivates  external  approaches,  diagnosing  the  net¬ 
work  through  end-to-end  measurements,  without  necessar¬ 
ily  assuming  the  cooperation  of  network  elements  on  the 
path.  There  has  been  much  recent  experimental  work  to 
understand  the  phenomenology  of  end-to-end  performance 


(e.g.,  see  [1],  [2],  [8],  [21],  [16],  [23],  [24],  [26]);  several 
measurement  infrastructure  projects  are  in  development 
(including  CAIDA  [6],  Felix  [10],  IPMA  [12],  NIMI  [15], 
Surveyor  [30])  with  the  aim  to  collect  and  analyze  end-to- 
end  measurements  across  a  mesh  of  paths  between  a  num¬ 
ber  of  hosts.  Standard  diagnostic  tools  for  IP  networks, 
ping  and  traceroute  report  roundtrip  loss  and  delay. 
A  recent  refinement  of  this  approach,  pathchar  [13],  es¬ 
timates  hop-by-hop  link  capacities,  packet  delay  and  loss 
rates,  pathchar  is  still  under  evaluation;  initial  expe¬ 
rience  indicates  many  packets  are  required  for  inference 
leading  to  either  high  load  of  measurement  traffic  or  long 
measurement  intervals,  although  adaptive  approaches  can 
reduce  this  [9].  More  broadly,  measurement  approaches 
based  on  Time  To  Live  (TTL)  expiry  require  the  coop¬ 
eration  of  network  elements  in  returning  Internet  Control 
Message  Protocol  (ICMP)  messages.  In  future,  encapsula¬ 
tion  may  hide  TTL  from  higher  layers  that  would  see  just 
a  single  hop  between  tunnel  endpoints.  Finally,  the  suc¬ 
cess  of  active  measurement  approaches  to  performance  di¬ 
agnosis  may  itself  cause  increased  congestion  if  intensive 
probing  techniques  are  widely  adopted. 

In  response  to  some  of  these  concerns,  a  multicast-based 
approach  to  active  measurement  has  been  proposed  re¬ 
cently  in  [3],  [4].  The  idea  is  that  correlation  in  perfor¬ 
mance  seen  on  mtervectmg  end-to-end  paths  can  be  used  to 
draw  inferences  about  the  performance  characteristics  of 
their  common  portion,  without  cooperation  from  the  net¬ 
work.  Multicast  traffic  is  well  suited  for  this  since  a  given 
packet  only  occurs  once  per  link  in  the  (logical)  multi¬ 
cast  tree.  Characteristics  such  as  loss  and  end-to-end  delay 
seen  at  different  endpoints  are  highly  correlated.  Another 
advantage  is  in  scalability.  Suppose  packets  are  exchanged 
on  a  mesh  of  paths  between  a  collection  of  N  measurement 
hosts  stationed  in  a  network.  With  unicast  the  probe  load 
on  the  network  may  grow  proportionally  to  in  some 
links  of  the  network,  with  multicast  the  load  grows  pro¬ 
portionally  only  to  N. 
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Fig.  1.  Left:  Two  leaf  tree.  Right:  m-leaf  tree. 


of  multicast  probe  packets.  It  is  assumed  that  the  link  de¬ 
lays  are  independent  random  variables,  both  spatially  (i.e. 
between  different  links)  and  temporally  (i.e.  between  dif¬ 
ferent  packets);  later  we  discuss  the  impact  of  violation 
of  these  assumptions.  The  method  rests  on  (generaliza¬ 
tions  of)  the  following  observation.  Consider  the  logical 
multicast  topology  of  Figure  l(left),  in  which  packets  are 
multicast  from  the  root  0  to  receivers  1  and  2.  Di  is  the 
random  delay  on  link  i,  and  the  source-to-leaf  delays  from 
the  root  0  to  the  leaf  nodes  1  and  2areXi  =  Dk  +  Di  and 
X2  =  Dk  +  D2  respectively.  Then  a  simple  calculation 
shows  that,  under  the  independence  assumption, 

Var(I}fc)  =  Cov(Xi,X2),  (1) 

i.e.  we  express  the  variance  of  an  internal  link  delay  in 
terms  of  the  covariance  of  the  source-to-leaf  delays.  We 
can  form  an  unbiased  estimate  of  Cov(Xi,X2)  directly 
from  end-to-end  measurements;  this  constitutes  an  unbi¬ 
ased  estimate  of  \lar{Dk).  The  same  method  extends  to 
higher  order  moments;  when  the  node  k  had  branching  ra¬ 
tio  m,  we  are  able  to  estimate  the  first  m  moments  of  Dk', 
see  Figure  1  (right).  We  specify  the  delay  model  in  Sec¬ 
tion  II  and  describe  the  basic  moment  estimators  in  Sec¬ 
tion  III. 

Here  we  focus  on  estimation  of  the  delay  variance,  ei¬ 
ther  on  individual  links,  or  from  the  root  to  a  given  node. 
In  Section  IV  we  show  how  the  above  scheme  can  be  used 
to  obtain  multiple  unbiased  estimates  of  the  variance  of  the 
delay  from  the  root  to  a  given  node  k,  one  estimator  for  ev¬ 
ery  pair  of  leaf  nodes  descended  through  different  children 
of  k.  The  estimates  are  consistent,  i.e.,  they  converge  in 
probability  to  the  true  variance  as  the  number  of  probes 
grows  to  infinity.  Any  convex  combination  of  these  esti¬ 
mators  shares  these  properties;  although  the  rate  of  con¬ 
vergence  will  be  different  in  each  case.  This  rate  can  be 
used  to  distinguish  between  the  estimators.  We  show  how 
to  choose  the  weights  in  order  to  obtain  the  combination 


with  the  fastest  asymptotic  rate  of  convergence. 

Packet  loss  reduces  the  number  of  packets  available  for 
delay  estimation,  hence  increasing  estimator  variance.  In 
Section  V  we  quantify  this  for  an  estimation  scheme  that 
makes  maximal  use  of  information  from  surviving  packets, 
using  all  packets  reaching  a  given  node  pair  for  which  a 
covariance  estimator  is  calculated. 

The  model  used  here  also  assumes  temporal  indepen¬ 
dence,  i.e.,  that  delays  between  successive  probe  packets  at 
a  given  node  are  not  dependent.  This  can  be  arranged  for 
by  making  the  interprobe  times  greater  than  the  queueing 
timescale.  However,  for  a  wide  class  of  temporally  depen¬ 
dent  delay  processes-we  require  only  ergodicity-the  con¬ 
sistency  of  the  estimators  is  unaffected,  i.e.,  they  still  con¬ 
verge  to  the  true  values  as  the  number  of  probes  grows  to 
infinity.  However,  the  rate  of  convergence  may  be  slower. 

In  Section  VI  we  report  two  types  of  simulation  (i) 
model  simulations  with  packet  delay  chosen  pseudo- 
randomly  according  to  a  given  distribution;  and  (ii)  ns  [22] 
simulations  that  represented  both  the  probe  traffic  mixed 
in  with  background  traffic  of  TCP  and  UDP  sessions  and 
delay  occurred  as  result  of  queueing  against  background 
traffic,  and  loss  due  to  buffer  overflow.  The  model  simula¬ 
tions  allow  us  to  compare  the  theoretical  prediction  with  a 
model  in  a  controlled  manner.  We  verify  the  accuracy  of 
the  delay  variance  estimator.  The  variance  of  the  variance 
estimators  over  many  simulation  runs  is  conformant  with 
the  model;  this  verifies  the  benefit  in  accuracy  of  using  the 
minimum  variance  estimator.  The  ns  simulation  allow  us 
to  investigate  the  performance  of  the  inference  method  in  a 
more  realistic  setting  in  which  the  independence  assump¬ 
tion  may  not  be  exactly  satisfied.  We  find  that  dependence 
between  delays  in  different  links  is  smaller  when  buffers 
are  larger,  and  that  inference  is  correspondingly  more  ac¬ 
curate.  In  a  12  node  topology  we  find  the  typical  error  in 
estimation  is  about  23%,  based  on  a  sample  size  of  1,000 
probes.  We  believe  this  is  sufficiently  accurate  to  distin- 
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guish  links  with  high  delay  variance.  As  far  as  we  are 
aware  there  are  no  studies  in  deployed  networks  that  mea¬ 
sure  delay  correlation  between  different  nodes.  However, 
we  helieve  that  large  and  long-lasting  spatial  dependence 
is  unlikely  in  a  real  network  such  as  the  Internet  because 
of  its  traffic  and  link  diversity. 

C.  Implementation  Requirements 

Since  the  data  for  delay  inference  comprises  one-way 
packet  delays,  the  primary  requirement  is  the  deployment 
of  measurement  hosts  with  synchronized  clocks.  (Ac¬ 
tually,  since  delay  covariances  are  invariant  under  time- 
shifts,  the  absolute  times  need  not  be  synchronized,  pro¬ 
vided  that  the  rates  are  identical).  Using  Global  Position¬ 
ing  System  (GPS)  timing  it  is  possible  to  make  one-way 
delay  measurements  accurate  to  within  tens  of  microsec¬ 
onds  or  better.  GPS  is  currently  used  or  planned  in  sev¬ 
eral  of  the  measurement  infrastructures  mentioned  earlier. 
The  Network  Time  Protocol  (NTP)  [17]  is  more  widely 
deployed,  but  provides  accuracy  in  only  the  order  a  few 
tens  of  milliseconds,  a  resolution  at  least  as  coarse  as  the 
queueing  delays  in  practice.  An  alternative  approach  to 
calibration  and  synchronization  of  clocks  has  been  devel¬ 
oped  in  [25],  [27],  [18]. 

Another  requirement  is  to  know  the  multicast  topol¬ 
ogy.  There  is  a  multicast-based  measurement  tool, 
mtrace  [19],  already  in  use  in  the  Internet,  mtrace  re¬ 
ports  the  route  from  a  multicast  source  to  a  receiver,  along 
with  other  information  about  that  path  such  as  per-hop 
loss  and  rate.  Presently  it  does  not  support  delay  mea¬ 
surements.  A  potential  drawback  for  larger  topologies  is 
that  mtrace  does  not  scale  to  large  numbers  of  receivers 
as  it  needs  to  run  once  for  each  receiver  to  cover  the  en¬ 
tire  multicast  tree.  In  addition,  mtrace  relies  on  multi¬ 
cast  routers  responding  to  explicit  measurement  queries; 
the  feature  that  can  be  administratively  disabled.  An  al¬ 
ternative  approach  that  is  closely  related  to  the  work  on 
multicast-based  loss  inference  [3],  [4]  is  to  infer  the  logical 
multicast  topology  directly  from  measured  probe  statistics; 
see  [5],  [28].  The  delay  variance  estimates  of  the  present 
paper  can  also  be  used  to  infer  topology.  This  method  does 
not  require  cooperation  from  the  network. 

D.  Use  of  Delay  Variance  Estimate 

Although  prior  work  has  characterized  end-to-end  de¬ 
lays  [1],  [21],  [24],  to  the  best  of  our  knowledge  there  is 
no  generally  accepted  model  for  per  link  delays  in  real  net¬ 
works.  Without  a  model  it  is  difficult  to  map  a  given  in¬ 
ferred  value  of  the  link  delay  variance  to  a  specific  value 
of  a  quality  metric,  such  as  the  probability  of  queueing  de¬ 
lay  exceeding  a  given  value.  Nevertheless,  we  believe  that 


knowledge  of  the  per  link  delay  variance  will  be  increas¬ 
ingly  useful  for  the  following  reasons: 

Model  Development.  The  mapping  problem  just  described 
will  become  easier  upon  development  of  delay  models. 
We  expect  these  to  arise  from  two  sources.  The  first  is 
the  development  of  measurement  infrastructure  projects 
in  which  selected  links  are  instrumented  for  one-way  de¬ 
lay  measurements.  The  second  is  the  development  of 
multicast-based  estimators  for  the  link  delay  distribution 
from  end-to-end  measurements,  using  a  more  computa¬ 
tionally  intensive  technique  proposed  in  a  companion  pa¬ 
per  [14].  We  anticipate  that  this  will  allow  the  develop¬ 
ment  of  link  delay  distribution  models,  with  the  distribu¬ 
tion  inferred  from  network  measurements. 

Ordering.  Identification  of  links  with  highest  delay  vari¬ 
ance  suggests  candidate  for  links  on  which  performance  is 
degraded  for  delay  sensitive  applications. 

Delay  and  Delay  Variation.  The  variance  of  the  packet 
delay  (on  a  link  or  path)  can  be  used  to  estimate  or  bound 
the  variance  of  the  interpacket  delay  variation.  Let  G®  be 
the  delay  encountered  by  packet  i  on  a  given  link.  The 
interpacket  delay  variation  (or  jitter)  between  packets  i  and 
i-|-l  onthelinkis  J®  =  i7®+^  —  i7®;  a  similar  notion  applies 
to  end  to  end  delay.  Observe 

Var(J®)  =  Var(G®)+Var(G®+i)-2Cov(G®,G®+i).  (2) 

Assuming  Df  )  to  be  stationary,  the  first  two  terms  on  the 
RHS  of  (2)  are  equal,  while  under  the  assumption  of  tem¬ 
poral  independence  the  last  term  is  zero,  and  so  Var(  J®)  = 
2Var(i7®).  Measurements  of  end-to-end  delays  in  the  In¬ 
ternet  [1]  show  that  end-to-end  delays  successive  packets 
are  only  slightly  dependent  when  the  interpacket  time  is 
longer  than  the  typical  queueing  timescales.  Stronger  de¬ 
pendence  is  found  at  shorter  timescales:  successive  pack¬ 
ets  are  more  likely  to  queue  together.  With  positive  corre¬ 
lation  between  successive  probe  delays  Cov(i7®,  i7®+^)  > 
0;  in  this  case  Var(J®)  is  bounded  above  by  2  Var(i7®),  a 
quantity  that  we  can  estimate  from  end-to-end  measure¬ 
ments. 

Topology  Inference.  If  the  logical  multicast  topology  is  not 
initially  known,  it  can  be  inferred  from  delay  variances. 
This  technique  uses  the  estimated  variance  of  the  cumula¬ 
tive  delay  from  the  source  to  a  given  node.  Consequently 
we  shall  be  interested  here  in  the  estimation  of  cumulative 
delay  variance  as  well  as  link  delay  variance. 

II.  The  Tree  and  Delay  Models 

We  identify  the  physical  multicast  tree  as  comprising  ac¬ 
tual  network  elements  (the  nodes)  and  the  communication 
links  than  join  them.  The  logical  multicast  tree  comprises 
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the  branch  points  of  the  physical  tree,  and  the  logical  links 
between  them.  The  logical  links  comprise  one  or  more 
physical  links.  Thus  each  node  in  the  logical  tree,  except 
the  leaf  nodes  and  possibly  the  root,  must  have  2  or  more 
children.  We  can  construct  the  logical  tree  from  the  phys¬ 
ical  tree  by  deleting  all  links  with  one  child  and  adjust¬ 
ing  the  links  accordingly  by  directly  joining  its  parent  and 
child. 

Let  T  =  {V,L)  denote  a  logical  multicast  tree  with 
nodes  V  and  links  L.  We  identify  one  node,  the  root  0, 
with  the  source  of  probes,  and  R  C  V  will  denote  the  set 
of  leaf  nodes  (identified  as  the  set  of  receivers).  The  set  of 
children  of  node  j  £  V  is  denoted  by  d{j).  Each  node,  k, 
apart  from  the  root  has  a  parent  f{k)  such  that  (j,  k)  G  L. 
Define  recursively  fhe  composifions  /”  =  /  o  wifh 
=  /.  Nodes  are  said  fo  be  siblings  if  fhey  have  fhe 
same  parenf.  If  A;  =  /*"  (j )  for  some  m  G  N  we  say  fhaf  j 
is  descended  for  k  (or  equivalenfly  fhaf  k  is  an  ancesfor  of 
j)  and  wrife  fhe  corresponding  parfial  order  inV  as  j  ^  k. 
iy  j  will  denofe  fhe  minimal  common  ancesfor  of  i  and  j 
in  fhe  ^-ordering. 

We  associafe  each  node  k  a  random  variable  Dk  faking 
values  in  fhe  exfended  posifive  real  line  ]R  =  ]R_|_U{oo}. 
By  convenfion  Do  =  0.  is  fhe  random  delay  fhaf  would 
be  encounfered  by  a  packef  attempting  fo  fraverse  fhe  link 
{f{k),  k)  G  L.  The  value  =  oo  indicates  fhe  packef  is 
losf  on  fhe  link.  The  delay  experienced  on  fhe  pafh  from 
fhe  roof  0  fo  a  node  k  is  assume 

fhaf  fhe  are  independenf.  Lef  =  P[Dk  <  oo],  fhe 
probabilify  of  successful  fransmission  over  link  k. 

III.  Non-parametric  Estimation  of  Delay 
Distribution  Moments 

In  fhis  section  we  presenf  a  class  of  non-paramefric  es¬ 
timators  of  fhe  delay  disfribufion.  We  assume  inifially  fhaf 
all  delays  are  finite:  P[Dk  =  oo]  =  0.  Consider  firsf  a 
logical  subfree  formed  by  fhe  roof  0,  and  a  non-leaf  node 
k  wifh  fwo  descendenfs  1  and  2  fhaf  are  leaf  nodes;  see 
Eigure  l(lefl).  By  writing  Xi  =  Xk  +  {Xi  —  Xk)  in  fhe 
expression  for  Cov(Xi,  X2),  expanding  using  fhe  bilinear- 
ify  of  fhe  covariance  operator  Cov(-,  •),  and  using  fhe  mu- 
fual  independence  of  fhe  links  delays  X^^Xi  —  Xk  and 
X2  —  Xk,  we  obfain 

Cov(Xi,X2)  =  Var(Xfc).  (3) 


by  a  uniformly  minimum  variance  unbiased  esfimafor  of 
Si  2,  namely  5^12  where 
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(4) 

Af  a  node  wifh  branching  rafio  m  we  are  able  fo  es- 
fimafe  fhe  firsf  m  momenfs  of  fhe  delay  on  fhe  shared 
porfion  of  fhe  pafh  from  fhe  roof;  see  Eigure  l(righf). 
The  cumulanf  generating  funcfion  of  fhe  m  leaf  delays 
X  =  (Xi, . .  .Xm)  is  defined  for  6  G  by 


A(0;X)  =  logE[exp(^0,X,)].  (5) 

8  =  1 


The  cumulanfs  are  defined  by  parfial  differentiafion  w.r.f. 
fhe  componenfs  6i  (when  derivatives  exisf):  for  indices 

Ji )  •  •  • )  jm  G  Z-(-  sef 


A(0;X) 


(6) 


The  firsf  and  second  cumulanfs  and  of  a  single 
random  variable  are  ifs  mean  and  variance  respecfively. 
Knowing  fhe  cumulanfs  of  a  sef  of  random  variables  is 
equivalenf  fo  knowing  fheir  joinf  disfribufion.  The  cumu¬ 
lanfs  of  Dq  are  related  fo  fhose  of  fhe  Xi  as  follows.  Sef 
1  =  (!,...,!)  G 

Theorem  1:  /T^(X)  =  K'^{Xk).  Hence  any  unbi¬ 
ased  esfimafor  of  (X)  is  also  an  unbiased  esfimafor  of 

K^{Xk). 

Proof:  Observe  (Xi, . . . ,  X^)  =  A'^(Xi- 

Xk, . .  .,X,m  -  Xk)  +  A'i(lXfc)  =  A'™(Xfc).  The  firsf 
equalify  is  because  K  is  affine  in  each  of  ifs  argumenfs, 
fhe  second  because  fhe  cumulanf  of  a  sef  of  independenf 
random  variables  is  zero.  ■ 


IV.  Delay  Variance  Estimation  on  General 

Trees 

In  a  general  free  lef  Q{k)  =  {{i,j}  C  R  \  i  V  j  = 
A;, }  be  fhe  sef  of  disfincf  pairs  of  leaf-nodes  whose  A- 
leasf  common  ancesfor  is  k.  Any  convex  combinafion 
(i-®-  wifh  fhe >  0  and  summing 
fo  1)  is  also  an  unbiased  esfimafor  of  An  example  fhe 
uniform  estimator 


Hence  any  unbiased  estimator  of  Cov(Xi,X2)  is  also 
an  unbiased  estimator  of  Var(X/;).  Eet  xj'^X^'^  i  = 
l,2,...ra  be  measured  end-to-end  delays  between  the 
root  0  and  leaf  nodes  1  and  2  respectively.  Abbreviate 
Cov(Xj  ,  Xk)  by  Sjk  and  write  Skk  as  Sk-  We  estimate  Sk 


#Q(k) 


E 


(7) 


{hj}eQ{ivj) 


One  potential  disadvantage  with  the  uniform  estimator  is 
that  high  variance  of  one  of  the  summands  may  lead  to 
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high  estimator  variance  overall.  This  motivates  choosing 
convex  combinations  that  are  functions  of  the  end-to-end 
delays  themselves  in  order  to  reduce  variance.  In  this  sec¬ 
tion  we  shall  assume  that  all  delays  a  finite  with  hounded 
fourth  moments.  We  shall  relax  the  finiteness  assumption 
in  Section  V. 

We  formalize  the  notion  of  (possibly  random)  convex 
combinations  of  through  a  covariance  aggregator.  For 
S'  C  i?  let  !Fn{S)  denote  the  cr-algebra  generated  by  the 
end-to-end  delays  {Xk)kes  (he.  the  set  of  events  that 
can  be  determined  from  knowing  {Xk)kes)-  A  covariance 
aggregator  fj,  is  sequence  (/r(ra))„gN  of  random  vectors 
{l^ij{n)  :  {i,j}  G  Q{k)-,  k  G  y\i?}withO  <  l^ij{n)  <  1 
and  (ra)  =  1  for  each  k  e  V  \  R.  We  as¬ 

sume  each  fin  to  be  JF^(i?) -measurable,  i.e.,  it  is  a  func¬ 
tion  of  the  measured  delays  of  the  first  n  probes.  We  will 
usually  suppress  the  explicit  dependence  on  the  number 
of  probes  n.  Let  s  =  {%(ra)  :  {i,i]  G  Q{k);  k  G 
y  \  i?}  be  a  family  of  estimators,  %  (ra)  being  an  (i ,  j )  - 
measurable  unbiased  estimator  of  Var(Xjvj) .  Then  we  es¬ 
timate  \lar{Xk)  by 


to  a  Gaussian  random  variable  of  mean  zero  and  variance 

fi  ■  C{k)  ■  fi. 

( ii )  The  minimal  asymptotic  variance  inf  fi-C{k)  ■  fi 
is  achieved  when 

fi,,  =  fi*,{C{k))  :=  (C(fc)-i  •  /l  •  C{k)-^  •  1 

(10) 

where  C{k)~^  denotes  the  inverse  matrix  of  C{k)  and 
=  1,  G  Q{k).  The  corresponding  asymptotic 

variance  of  the  variance  estimator  is  (l  •  •  l)  ^ . 

Proof:  (i)  The  proof  follows  from  standard  results  in  mul¬ 
tivariate  analysis;  convergence  to  the  stated  Gaussian  ran¬ 
dom  variable  follows  by  Corollary  1.2.18  in  [20] 

(ii)  Since  the  fn,  sum  to  1,  the  proof  follows  by  consid¬ 
ering  the  constrained  minimization  oi  fi-C{k)  ■  fi  —  2kfi-l 
with  Lagrange  multiplier  k.  As  a  covariance  matrix,  C [k) 
is  positive  definite  and  hence  invertible;  minimization  of 
the  convex  function  of  fi  takes  place  at  the  the  stationary 
point  fi  =  kC{k)~^  ■  1.  This  yields  fi*{C{k))  upon  nor¬ 
malization.  The  corresponding  minimal  asymptotic  vari¬ 
ance  is /r*(C(A;))  •C(A;)  •/r*(C(A;))  =  {l-C{k)-Xi)-\ 


Vk{fi,s)=  ^  fi,,s,,  (8) 

A  covariance  aggregator  is  called  deterministic  if  it  does 
not  depend  on  the  X^'‘\  We  denote  the  set  of  such  ag¬ 
gregators  with  indices  in  Q{k)  hy  Vk-  An  example  is  the 
uniform  aggregator  that  was  used  in  the  uniform  estimator 
(7):  fii,  =  V  Define  the  covariance  matrix 

=  Cov  (  Z,Z,  ,  ZlZ,m  )  ,  (9) 

where  Zi  =  Xi  —  E[Xj].  We  will  use  C{k)  = 

obtained  by 

letting  the  indices  {ij )  and  (7 m)  in  (9)  run  over  Q{k)-,  this 
is  a  submatrix  of  the  matrix  C^{k)  obtained  by  taking  the 
indices  unrestricted  over  the  set  <5°  (A;)  of  binary  subsets  of 

R{k). 

A.  Minimum  Variance  Estimation  for  Cumulative  Delays 

In  the  next  theorem  we  characterize  the  asymptotic  dis¬ 
tribution  of  the  Si,  as  ra  — >  oo,  and  give  a  form  for  the 
estimator  14  (/U,  s)  of  minimum  cumulative  variance. 

Theorem  2:  ( i )  For  each  k  £  V  \  R  the  random  vari¬ 
ables  {\/n  (si,  —  Sk)  I  {lj}  G  Q{k)}  converge  in 
distribution  as  ra  — >  oo  to  a  multivariate  Gaussian  ran¬ 
dom  variable  with  mean  0  and  covariance  matrix  C{k). 
Hence  the  Si,  are  consistent  estimators  of  and  so  is 
V{fi,s).  For  any  deterministic  covariance  aggregator  fi, 
k/n{Vk{fi,s)  —  Sk)  converges  in  distribution  as  ra  — >  oo 


Operationally,  the  coefficients  fii,  of  the  minimum  vari¬ 
ance  estimator  Vk{fi*{C{k)),'s)  of  Theorem  2  are  to  be 
calculated  from  an  estimate  of  the  covariance  matrix  C  [k) . 
Let  Z^^  =  i  Let  C{k)  denote  the 

empirical  covariance  matrix  with  entries 


n 


[n  —  1)" 


EM:m  )  M:m  )  M:m  )  Mm  ) 

j' 


i  ^  ^  zf^zf^  I  (11) 


7Tl  =  l 


7Tl  =  l 


C{k)  is  an  unbiased  estimator  of  C{k).  Estimating 
fi*{C (k))  by  fi*{C (k))  ands^  by  Vk{fi* (C) ,s)  potentially 
introduces  bias  and  increases  variance  in  the  estimation 
of  the  Sk-  However,  the  following  Theorem  shows  that 
it  is  consistent  and  has  the  same  asymptotic  variance  as 
Vk{fi*{C),f). 

Theorem  3:  Vk{fi*  {C{k)),s)  is  a  consistent  estimator 
of  Sk-  yfi(Vk{fi*{C{k),s)  —  Sk)  converges  in  distribution 
to  a  Gaussian  random  variable  of  mean  zero  and  variance 

(i.c(fc)-i  •!);'. 

Proof:  Clearly  C{k)  converges  almost  surely  to  C{k)  as 
ra  — >  oo.  Since  matrix  inversion  is  continuous  on  the  set 
of  strictly  positive  definite  matrices,  fi*{C{k))  converges 
almost  surely  (to  fi*{C{k))y,  since  each  Si,  converges  to 
Si,  =  Sk,  Vkifi* {C {k)) M)  is  consistent. 

By  the  5-method  (see  e.g.  [29]), -^4  {V{fj*{C{k)),'s)  — 
Sk)  converges  to  a  Gaussian  random  variable  with  mean  0 
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and  variance  a  ■  C^(k)  ■  a,  where  for  (£,  m)  G  Q^{k), 


C.  Criteria  for  Assessing  Inference  Reliability 


^Im  — 


d 

dSfr^ 


(12) 


Differentiating, 


Oilra  =  ifl^{C{k))xQ(k){{^,m}) 

+  E  s.,^G.(C(*.-)). 


tm 


(13) 


where  XQ(k)  denotes  the  indicator  function  of  the  set  Q  [k) . 
But  Sij  =  Sivj  =  Sk  for  in  Q{k)  and  so  is  constant 
in  the  sum.  Since  the  sum  to  1,  the  sum  in  (13)  is  zero. 
Hence  a  •  C^{k)  ■  a  =  ti*{C{k))  ■  C{k)  ■  ti*{C{k)).  | 


B.  Minimum  Variance  Estimation  for  Link  Delays 


In  sections  IV-A  and  IV-B  we  derived  expressions  for 
the  variances  of  estimates  of  the  cumulative  and  link  de¬ 
lays  respectively.  For  a  given  delay  variance  estimate,  we 
can  associate  its  variance  hy  using  the  plug  in  estimator 
for  the  corresponding  analytic  expression.  This  enables 
use  to  find  confidence  intervals  for  the  estimates  that  will 
he  asymptotically  accurate  for  large  n.  For  example,  if  we 
use  n  prohes  to  form  the  estimate  14  (C( A;) ) ,  s) ,  we  as¬ 
sociate  with  this  a  variance  cr^/ n  where  =  (1  •  C(A;)  • 

1)“^.  We  write  confidence  limits  for  the  estimate  as 

Vk{li*{C{k),f)±zsi2CT/n,  (17) 

where  2:^/2  denotes  the  number  that  cuts  off  an  area  5/2  in 
the  right  tail  of  the  standard  normal  distribution.  This  is 
used  for  a  confidence  interval  of  level  1  —  5. 


We  can  estimate  the  link  delay  variance  as  the  difference 
of  two  cumulative  variances  since 


Var(Xfc)  =  Var(Xj(fc)  -f  Dk)  =  Var(Xj(fc))  -f  Var(Dfc), 

(14) 

by  the  independence  assumption  on  link  delays.  An  un¬ 
biased  estimator  of  r/;  :=  \lar{Dk)  isVk{tk*{C{k)),s)  — 
Vf(^k){b*{C{f{k))),s).  We  now  show  that  joint  optimiza¬ 
tion  of  the  aggregators  in  14  and  l//(fc)  will  result  in  an 
estimator  of  lower  variance. 

Given  a  pair /r  =  {fi{k) ,  fi{f  {k)))  e  Vk  X  Vf^k)  of 

deterministic  covariance  aggregators  with  indices  in  Q{k) 
and  Q{f{k))  respectively,  we  can  form  a  unbiased  esti¬ 
mate  of  rk  as 


Wkiii^f)  :=  14(mW,s)  -  Vj^kMf{k)),f)  (15) 

LetC'(A;)  denote- the  ^Q{k)+^Q{f{k))  dimensional  ma¬ 
trix  written  in  block  form 


C'{k) 


C{k)  C{kJ{k))\ 
C{kJ{k)f  C{f{k))  J 


(16) 


where  C{k,f{k))  is  the  :^Q(k)  x  #Q{f{k))  matrix 
of  covariances  [qT).(^™)]  (*2)eQ(4.(^™)eQ(/(4)- 

statements  analogous  to  Theorem  2(ii)  follow  straight¬ 
forwardly,  using  parallel  arguments.  In  particular 
V^{Wk{bi  s)  —  rk)  converges  to  a  Gaussian  random  vari¬ 
able  of  mean  0  and  variance  fj,  ■  C'{k)~^fj,  and  the  min¬ 
imum  over  deterministic  aggregators  of  the  asymptotic 
variance  takes  the  value  (ci  -|-  C2  -|-  2C3)  / (ciC2  —  Cg)  where 


ci  =  lk-C{k)  ^  ■lk,C2  =  lj(kyC{f{k)) 


-1 


G(4 


and 


Cg  =  l/(fc)  •  C {k,  f{k))~^  ■  Ik-  (Here  the  subscripts  on 
lk,lfy)  distinguish  the  subspaces  in  which  these  vectors 
live). 


V.  Impact  of  Loss  on  Estimator  Variance 

We  relax  the  assumption  of  finite  delays.  Here  we  iden¬ 
tify  infinite  delays  with  packet  loss,  although  the  same  re¬ 
sults  would  hold  were  we  to  treat  as  lost  any  packet  with 
source  to  leaf  delay  greater  than  some  finite  value.  The 
link  and  cumulative  delay  random  variables  will  be  de¬ 
noted  by  D'f,  and  X/  respectively  each  possibly  taking  the 
value  00.  We  use  Dk  to  be  the  distribution  of  75/  con¬ 
ditional  on  77/  <  00,  and  similarly  for  Xk-  We  assume 
throughout  that  the  D  k  have  finite  fourth  moments.  Since 
we  are  interested  in  delay  variance,  we  want  to  estimate 
\/ar{Xk)  and  \lar{Dk)  even  in  the  presence  of  packet  loss. 
For  estimation,  the  effect  of  packet  loss  is  to  reduce  the 
number  of  delay  samples  available,  and  hence  to  increase 
the  variability  of  the  estimates.  A  simple  way  to  apply 
the  foregoing  theory  is  to  restrict  attention  to  only  those 
packets  that  are  received  at  every  leaf  (or  at  least  at  every 
element  of  R{k)  when  estimating  Sk).  A  disadvantage  of 
this  approach  is  that  is  does  not  scale  well  as  the  topology 
grows.  For  assuming  link  loss  rates  to  be  bounded  away 
from  zero,  the  proportion  of  packets  reaching  all  receivers 
in  a  tree  decays  geometrically  fast  in  the  number  of  links 
in  the  tree. 

An  alternative  that  wastes  less  data  is  to  calculate  pair¬ 
wise  estimates  of  4^  that  use  all  packets  received  at  i  and 
j.  Let  us  formalize  this.  For  a  subset  of  receivers  S'  C  H 
define  7„(S)  =  {i  G  {1,  2, . . . . ,  ra}  |  <  00  Vj  G  S}: 

the  subset  of  the  first  n  probes  that  are  received  at  all  nodes 
in  S;  set  X„(S)  =  #In{S).  We  will  sometimes  write 
7„(4,  •  •  ■Dr)  for  7„({ii,  •  •  -Dr}),  and  similarly  for  X„. 
For  S  C  7?  let  V(S)  be  the  set  of  nodes  in  the  minimal 
tree  spanning  0  and  S.  Set  B{S)  =  HiGPls)  where 
ai  is  the  probability  of  successful  transmission  over  link 
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•  •  •  • 

9  10  11  12 


Fig.  2.  Trees  Used  in  Simulations.  Leet:  8-leaf  binary  tree  for  model  simulations ;  Right:  Heterogeneous  7-leaf  tree  for 
ns  simulation. 


k.  Clearly  n~^Nn{S)  converges  almost  surely  to  B{S) 
as  ra  —I  oo.  Estimator  variance  can  be  reduced  by  us¬ 
ing  all  packets  in  /„  (i ,  j )  to  estimate  ,  not  just  those  in 
In{R{i  V  j).  Define 


where  the  sums  run  over  /„ (i,  j,  k,i). 

The  corresponding  version  of  the  minimum  variance 
link  delay  variance  estimator  follows  by  replacing  C  by 
G  and  s  by  u  throughout  Section  IV-B. 


1  m.m' 


(18) 


where  N  abbreviates  j)  and  in  the  sums  m,  m'  run 
over  In{i,i).  iyy  is  an  unbiased  estimate  on  s^y.  Analo¬ 
gous  to  the  previous  results  we  have 

Theorem  4:  (i)  For  each  k  £  V  \  R  the  random  vari¬ 
ables  {\/n  (vij  —  Sfc)  I  {i,  i}  G  Q(k)}  converge  in  distri¬ 
bution  as  ra  — I  oo  to  a  multivariate  Gaussian  random  vari¬ 
able  with  mean  0  and  covariance  matrix  G'(A;)(jyy = 
C(f^)(tj),(£m)R(hJJ,m)/(B(i,j)B(£,m)).  Hence  the 
Vij  are  consistent  estimators  of  and  so  is  14(/i,  u)  for 
any  deterministic  covariance  aggregator  /i.  For  any  deter¬ 
ministic  covariance  aggregator  fi,  ^/n{Vk{n^  ~  ^k)  con¬ 

verges  in  distribution  as  ra  — i  oo  to  a  Gaussian  random 
variable  of  mean  zero  and  variance  fi  ■  G{k)  ■  fi. 

( ii)  The  minimal  asymptotic  variance  inf^gp*,  n-G{k)  ■  fi 
is  achieved  when  jj,  =  jj,*  (G) ;  the  corresponding  minimal 
asymptotic  variance  is  (l  •  G{k)~^  ■  l) 

(in)  Vk{in*  (G)  ,v)  has  the  same  asymptotic  properties  as 
14  (m*  (G)  ,  v)  where  the  estimated  covariance  G  is  defined 
by 


JV^(i,j)JV^(k,£) 

Nn(i,j,k,£) 

1 


N„(i,j,k,£) 


y("i)  y(m)  y(m)  y(m) 

Gpy),(H)  - 

m 

^  ^  (19) 


VI.  Simulation  Evaluation 

We  conducted  two  types  of  simulation  (i)  model  simula¬ 
tion  with  packet  delay  chosen  pseudo-randomly  according 
to  a  given  distribution;  and  (ii)  ns  [22]  simulations  that  rep¬ 
resented  both  the  probe  traffic  mixed  in  with  background 
traffic  of  TCP  and  UDP  sessions  and  delay  occurred  as  re¬ 
sult  of  queueing  against  background  traffic,  and  loss  due  to 
buffer  overflow.  The  model  simulations  allow  us  to  com¬ 
pare  the  theoretical  prediction  with  a  model  in  a  controlled 
manner;  their  purpose  is  to  show  that  the  statistical  prop¬ 
erties  of  the  estimators  conform  to  the  model  used.  The  ns 
simulation  allow  us  to  investigate  the  performance  of  the 
inference  method  in  a  more  realistic  setting  in  which  the 
model  assumption  (such  as  independence)  may  not  be  ex¬ 
actly  satisfied.  Their  purpose  is  to  investigate  conformance 
of  the  predicted  delay  variances  with  those  occurring  in  the 
network  interior. 

A.  Model  Simulations 

The  model  simulation  used  an  8  leaf  binary  tree  (see 
Figure  2(left));  delays  were  exponentially  distributed.  The 
delay  variances  were  heterogeneous:  leaf  links  8  and  15 
had  delay  variance  10,  all  other  links  had  delay  variance  1. 
Fosses  were  not  modeled.  This  heterogeneity  was  chosen 
in  order  to  evaluate  the  advantages  of  the  minimum  vari¬ 
ance  estimator.  We  present  a  representative  set  of  results 
from  experiments  for  the  link  delay  variance  W  and  the 
cumulative  delay  variance  V. 
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Fig.  3.  Variance  of  Estimated  Variance.  Cumulative  Delay  Variance  to  nodes  1,6,10,15  in  Figure  2(left).  Leet:  Calculate 
Variance;  Right:  Empirical  Variance  from  100  simulations. 


Weight  jiy 

Link  pairs  (i,i) 

0.000018 

0.001213 

0.001811 

0.081286 

0.121322 

0.181077 

(8,15) 

(8.12)  (8,13)  (10,15)  (11,15) 

(8.14)  (9,15) 

(10.12)  (10,13)  (11,12)  (11,13) 

(9.12)  (9,13)  (10,14)  (11,14) 

(9.14) 

TABLE  I 

Weights  eor  Minimum  Variance  Estimator.  Topology 
of  Figure  2.  Links  8  and  15  have  ten  times  variance  of  others. 

A.  1  Convergence 

Figure  3  shows  the  variance  of  the  cumulative  delay 
variance  from  sources  to  nodes  k  =  1,6, 10, 15  in  Fig¬ 
ure  2(left),  plotted  as  a  function  of  the  number  of  prohes. 
On  the  left  is  the  theoretical  variance  Var(T4(;U*(C),  s^)); 
on  the  right  the  empirical  variance  from  100  samples  of 
f4(M*(C'),  found  hy  simulation.  Observe  in  both  cases 
the  decay  of  the  variance  towards  0  as  the  number  of 
probes  increases;  furthermore  the  experimental  variance  is 
very  close  to  the  theoretical  values  over  the  range  of  probe 
numbers. 

Figure  4  shows  detail  from  a  single  simulation;  sam¬ 
ple  paths  of  the  link  variance  estimator  Wk  (/U,  's)  for  links 
k  =  1,3,  5, 10  as  function  of  the  number  of  probes,  for 
up  to  10,000  probes.  On  the  left  figure,  the  aggregator  fi 
is  uniform,  on  the  right,  the  minimum  variance  aggregator 
fi*  (C) .  Observe  in  both  cases  that  the  estimate  approaches 
the  model  value,  1,  as  the  number  of  probes  increases. 

A.2  Variance  Reduction 

In  Figure  4,  convergence  is  tighter  for  the  minimum 
variance  estimator  (on  the  right)  than  in  the  uniform  case; 
this  is  particularly  apparent  in  the  left  region  of  each  plot, 
corresponding  to  smaller  numbers  of  probes.  The  differ¬ 


ence  is  particularly  evident  for  link  1  (which  has  2  high 
variance  links  as  descendents,  8  and  15)  and  link  3,  which 
has  link  15  as  a  descendent.  The  variance  of  the  estima¬ 
tors  Wk  for  both  these  links  is  decreased  in  the  minimum 
variance  estimator,  relative  to  the  uniform  estimator,  by  re¬ 
ducing  the  weight  jUj  when  i  or  j  corresponds  to  a  high 
variance  link.  This  is  particularly  striking  in  the  mini¬ 
mum  variance  estimator  for  link  1 ;  we  tabulate  the  weights 
in  Table  I.  The  weight  for  the  pair  (8, 15)  of 
high  variance  links  is  10““^  times  the  highest  weight,  that 
for  pair  (9, 14). 

To  see  the  statistics  of  estimator  variation  reduction,  we 
display  in  Figure  5  the  ratio  of  the  standard  deviation  of 
the  uniform  estimator  to  the  standard  deviation  of  the  min¬ 
imum  variance  estimator,  and  a  function  of  the  number 
of  probes.  This  in  shown  on  the  left  for  the  cumulative 
variance,  and  on  the  right  for  the  link  delay  variance.  For 
the  cumulative  variance  we  display  only  for  links,  1,2  and 
3;  the  other  internal  links  the  uniform  and  minimum  vari¬ 
ance  estimators  are  identical  because  there  is  only  one  term 
in  the  sum  for  V.  The  figures  show  that  the  reduction  in 
variance  is  roughly  uniform  across  a  range  of  experiment 
length  up  to  10,000  probes.  The  standard  deviation  was 
roughly  halved  for  the  cumulative  delay  variance,  and  be¬ 
tween  0.3  and  0.5  for  the  link  delay  variance.  Reduction 
was  somewhat  greater  for  the  standard  deviation  of  the  link 
delay  variance,  except  for  nodes  4  and  7.  These  nodes  have 
only  two  descendants,  one  of  which  terminates  a  high  vari¬ 
ance  link;  there  is  no  flexibility  to  avoid  the  high  variance 
of  the  first  term  of  Wk  =  Vk  — 

B.  Network  Simulation 
B.l  Methodology 

The  ns  simulations  used  the  topology  in  Figure  2(right). 
We  arranged  for  some  heterogeneity  between  the  edges 
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Fig.  4.  Sample  Paths  of  Link  Variance  Estimators.  For  up  to  10,000  probes.  Left:  Uniform  Estimator,  Right: 
Minimum  Variance  Estimator 
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Fig.  5.  Variance  Reduction  Ratio  of  standard  deviation  the  uniform  estimator  to  standard  deviation  minimum  variance 
estimator.  Left:  cumulative  delay  variance,  Right:  link  delay  variance. 


and  the  center  of  the  tree  in  order  to  mimic  the  difference 
between  the  core  and  edges  of  a  large  WAN,  with  the  in¬ 
terior  of  the  tree  having  higher  capacity  (5Mb/sec)  and  la¬ 
tency  (50ms)  than  at  the  edge  (IMb/sec  and  10ms).  Each 
node  had  a  finite  buffer  capacity;  packet  losses  were  due 
to  drops  for  the  tail  of  the  buffer.  We  used  buffer  capac¬ 
ities  of  4  and  20  packets  in  two  different  sets  of  experi¬ 
ments.  The  cross  traffic  comprised  66  FTP  sessions  over 
TCP,  and  29  UDP  fraffic  sources  following  an  exponen- 
fial  on-off  model;  fhere  were  on  average  around  8  back¬ 
ground  fraffic  sources  per  link.  In  each  simulafion  we  use 
fhe  source-lo-leaf  delays  of  probes  as  dafa  fo  infer  delay 
variance  per  infernal  link  by  and  also  from  fhe  source  fo  a 
given  infernal  node.  Since  fhe  simulafions  exhibif  packef 
loss,  fhe  inference  was  performed  using  fhe  algorifhms  de¬ 
scribed  in  Secfion  V.  We  compared  fhe  inferred  values  Wk 
wifh  fhe  acfual  delay  variance  for  probes  on  infernal  links 
fhaf  was  observed  during  fhe  simulafion  run.  The  compar¬ 
ison  was  performed  over  each  link  in  Figure  2(righl)  for 
100  simulafion  runs. 


B.2  Comparison  of  Inferred  and  Acfual  Delay  Variance 

Figure  6  shows  scalier  plols  of  1200  pairs  of  (inferred, 
acfual)  link  delay  variance,  based  on  1000  probes,  on  fhe 
lefl  wifh  buffer  capacily  of  4  packels,  on  fhe  righl  wifh 
buffer  capacily  20  packels.  Also  shown  is  fhe  line  Ihrough 
fhe  origin  al  gradienl  1 ;  a  poinl  on  Ihis  line  would  indi¬ 
cate  an  inslance  of  perfecl  inference.  In  fhe  scalier  plols 
we  differenliale  belween  predictions  using  fhe  uniform  es- 
limalor,  and  Ihose  using  fhe  minimum  variance  eslimalor. 

Taking  each  plol  separalely  we  observe  lhal  inference 
is  more  accurale  for  fhe  minimum  variance  eslimalor  lhan 
fhe  uniform  estimator,  fhe  difference  being  more  evidenl 
for  fhe  smaller  buffer  size.  Comparing  fhe  plols  we  see  lhal 
inference  is  more  accurate  when  for  Ihe  simulated  nelwork 
wilh  larger  buffer  capacities,  particularly  for  small  delay 
variances.  A  small  number  of  inferred  values  were  nega¬ 
tive.  This  occurred  for  some  links  of  high  bandwidlh  for 
which  queueing  delays  were  small.  Estimation  of  Ihe  link 
delay  variance  as  Ihe  difference  belween  Ihe  variance  of 
Ihe  cumulative  delays  (see  (15))  is  sensitive  to  estimation 
errors.  Neverlheless,  Ihe  estimation  error  is  suflicienlly 
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Fig.  6.  Scatter  plots  for  link  delay  inference.  Pairs  of  (actual, inferred)  over  12  links  in  100  experiments,  each  using 
1000  probes  Left:  nodes  have  4  packet  buffers;  Right:  nodes  have  20  packet  buffers. 


small  that  is  would  not  impair  identification  of  those  links 
with  the  largest  delay  variance.  Furthermore,  in  practice 
we  can  avoid  the  worst  small  variance  estimation  errors 
hy  eliminating  estimates  that  are  not  significantly  different 
from  zero  according  at  some  confidence  level.  Similar  to 
(17),  these  are  the  estimates  Wk  from  n  prohes  for  which 
Wk  <  zs(J I y/n,  1  —  5  is  the  desired  (one-sided)  confi¬ 
dence  level,  and  is  the  appropriate  asymptotic  variance 
expressed  in  terms  of  the  estimated  covariance. 

We  attribute  bias  of  inference  to  departures  of  the  delay 
process  from  the  independence  assumption  of  the  model. 
We  calculated  the  off-diagonal  elements  of  the  correlation 
matrix  of  the  actual  link  delays.  For  buffer  size  4  the  mean 
value  was  0.071,  the  maximum  0.51.  For  buffer  size  20 
the  mean  was  0.021,  the  maximum  0.17.  Thus  correla¬ 
tions  were  more  pronounced  for  the  smaller  buffer  size, 
leading  to  greater  inference  inaccuracy.  We  found  that  bias 
was  more  pronounced  in  the  inference  of  cumulative  delay, 
particularly  for  buffer  size  20  where  the  cumulative  delay 
variance  is  almost  always  overestimated.  Bias  was  less  ev¬ 
ident  for  the  link  delay  variance.  Since  this  is  expressed  as 
a  difference  of  estimated  cumulative  delay  variance,  con¬ 
sistent  bias  in  the  latter  quantities  should  cancel  somewhat 
in  subtraction.  Conversely,  small  delay  variances  are  better 
estimated  for  for  the  cumulative  than  the  link  case. 

In  order  to  quantify  the  accuracy  of  inference  we  define 
a  metric  for  evaluating  estimator  accuracy.  If  w  and  w  are 
the  actual  and  inferred  delay  variances  (either  cumulative 
to  a  link  or  at  the  link  itself)  we  form  their  error  factor 

^  N  ( w  w'\ 

F(w,  w)  =  max  s  — ,  —  1  •  (20) 

[w  w  } 

For  example,  if  w  is  either  twice  or  half  w,  their  error  fac¬ 
tor  is  2.  Asa  robust  summary  statistic  to  capture  the  center 
of  the  distribution  of  error  factors,  we  use  the  two-sided 


buffer  =  4  pkts. 

buffer 

=  20  pkts. 

2^ 

unif. 

min.  var. 

unif. 

min.  var. 

0 

2.31 

2.06 

1.32 

1.32 

2 

1.56 

1.76 

1.23 

1.23 

TABLE  11 

Quartile  Weighted  Median  Error  Eactors  for 
Inference  on  1000  Packets.  Link  delay  variance 
estimation,  according  to  number  of  standard  deviations  z  in 
confidence  level  to  avoid  small  variances.  Errors  are  smaller  for 
minimum  variance  estimator  than  uniform  estimator,  and  also 
with  increased  buffer  capacity. 

quartile-weighted  median  (QWM) 

(Q  .25  +  ‘^Q  .5  +  Q  .75) /4:,  (21) 

where  Qp  denotes  the  ^th 

quantile  of  a  given  set  of  error 

factors. 

In  Table  11  we  display  the  QWM  of  error  factors  for 
link  variance  estimation.  Small  or  negative  inferred  vari¬ 
ances  were  omitted,  the  quantity  z  being  the  number  of 
standard  deviations  characterizing  the  confidence  interval 
about  0.  z  =  0  corresponds  to  rejecting  only  negative 
inferred  variances.  Ruling  out  these  small  variances  de¬ 
creases  the  QWM  of  the  error  factor:  the  smaller  variances 
typically  have  higher  error  factor.  (For  z  =  2,  buffer  =  4, 
it  happens  that  the  75*^-percentile  of  the  error  factor  dis¬ 
tribution  is  larger  for  the  minimum  variance  estimator,  but 
this  is  atypical).  For  large  buffer  sizes  the  error  factors  are 
noticeably  smaller;  the  difference  in  accuracy  between  the 
uniform  and  minimum  variance  estimator  is  smaller  too. 
We  found  no  great  advantage  in  increasing  the  number  of 
probes  to  10,000  since  bias  becomes  a  larger  part  of  the 
errors. 
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VII.  Conclusions  and  Further  Work 

In  this  paper  we  have  proposed  a  novel  teehnique  for  the 
inference  from  end-to-end  measurements  of  the  variance 
of  the  delay  encountered  hy  multicast  packets  on  an  inter¬ 
nal  link.  The  cooperation  of  intervening  network  nodes  is 
not  required. 

We  constructed  a  convex  family  of  variance  estimators 
and  found  the  estimator  of  minimal  asymptotic  variance. 
Evaluating  the  minimal  variance  estimator  comes  at  some 
computational  cost,  namely,  the  inversion  of  the  covari¬ 
ance  matrix  C .  In  work  to  he  reported  elsewhere,  we  show 
how  this  computation  may  he  considerably  simplified  for 
binary  trees,  although  at  the  cost  of  increasing  estimator 
variance  somewhat.  Another  approach  is  to  compromise 
between  the  computational  simplicity  of  the  uniform  esti¬ 
mator  and  variance  reduction.  An  example  would  be  to  set 
fi-  =  0  for  {ui}  in  some  subset  of  Q{k)  in  which  the 
measures  end-to-end  variances  s*  are  high.  It  remains  to 
develop  a  robust  approach  along  these  lines. 

The  ns  experiments  showed  typical  errors  of  about  20% 
in  estimation  of  the  delay  variance  using  1 ,000  probes.  We 
observe  that  using  a  40  bytes  probe  every  100ms,  the  load 
on  the  network  is  less  that  4kb/sec  and  the  measurements 
can  be  completed  within  2  minutes. 

We  found  inference  to  be  more  accurate  in  networks 
with  larger  buffers;  there  was  smaller  correlation  between 
delays  at  different  nodes  and  hence  closer  conformance  to 
the  underlying  model.  It  appear  that  the  larger  buffers  ad¬ 
mit  a  greater  diversity  of  connections  through  a  node  over 
queueing  timescales,  diluting  the  correlation  seen  between 
delays  at  successive  nodes.  We  believe  that  diversity  of 
traffic  in  real  networks  such  as  the  Internet  makes  large 
and  long  lasting  correlations  unlikely.  Furthermore  the  in¬ 
troduction  of  Random  Early  Detection  (RED)  [11]  policies 
in  Internet  routers  may  help  reduce  dependence;  evidence 
for  this  comes  from  related  work  on  internal  link  loss  in¬ 
ference  [4],  where  the  introduction  of  RED  was  found  to 
increase  accuracy  of  inference  relative  to  networks  with  a 
Drop  from  Tail  packet  discard  mechanism. 
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Multicast  Topology  Inference  from  Measured 

End-to-End  Eoss 

N.G.  Duffield,  J.  Horowitz,  F.  Lo  Presti,  D.  Towsley 


Abstract — The  use  of  multicast  inference  on  end-to-end  measnre- 
ment  has  recently  been  proposed  as  a  means  to  infer  network  inter¬ 
nal  characteristics  such  as  packet  link  loss  rate  and  delay.  In  this  pa¬ 
per  we  propose  three  types  of  algorithm  that  nse  loss  measnrements 
to  infer  the  underlying  multicast  topology:  (i)  a  grouping  estima¬ 
tor  that  exploits  the  monotonicity  of  loss  rates  with  increasing  path 
length;  (ii)  a  maximum  likelihood  estimator;  and  (iii)  a  Bayesian 
estimator.  We  establish  their  consistency,  compare  their  complexity 
and  accuracy,  and  analyze  the  modes  of  failnre  and  their  asymptotic 
probabilities. 

Keywords:  Communication  Networks,  End-to-End  Measnrement, 
Maximum  Likelihood  Estimation,  Multicast,  Statistical  Inference, 
Topology  Discovery. 

I.  Introduction 

A.  Motivation. 

In  this  paper  we  propose  and  evaluate  a  number  of  al¬ 
gorithms  for  the  inference  of  logical  multicast  topologies 
from  end-to-end  network  measurements.  All  are  devel¬ 
oped  from  recent  work  that  shows  how  to  infer  per  link 
loss  rate  from  measured  end-to-end  loss  of  multicast  traf¬ 
fic.  The  idea  behind  this  approach  is  that  performance 
characteristics  across  a  number  of  intersecting  network 
paths  can  be  combined  to  reveal  characteristics  of  the  in¬ 
tersection  of  the  paths.  In  this  way,  one  can  infer  charac¬ 
teristics  across  a  portion  of  the  path  without  requiring  the 
portion’s  endpoints  to  terminate  measurements. 

The  use  of  active  multicast  probes  to  perform  mea¬ 
surements  is  particularly  well  suited  to  this  approach  due 
to  the  inherent  correlations  in  packet  loss  seen  at  differ¬ 
ent  receivers.  Consider  a  multicast  routing  tree  connect¬ 
ing  the  probe  source  to  a  number  of  receivers.  When  a 
probe  packet  is  dispatched  down  the  tree  from  the  source, 
a  copy  is  sent  down  each  descendant  link  from  every 
branch  point  encountered.  By  this  action,  one  packet  at 
the  source  gives  rise  to  a  copy  of  the  packet  at  each  re¬ 
ceiver.  Thus  a  packet  reaching  each  member  of  a  subset 
of  receivers  encounters  identical  conditions  between  the 
source  and  the  receivers’  closest  common  branch  point  in 
the  tree. 

N.G.  Duffield  and  F.  Lo  Presti  are  with  AT&T  Labs-Research,  180 
Park  Avenue,  Florham  Park,  NJ  07932,  USA,  E-mail:  {duffield,  lo- 
presti}@research.att.com.  J.  Florowitz  is  with  the  Dept,  of  Math.  & 
Statistics,  University  of  Massachusetts,  Amherst,  MA  01003,  USA,  E- 
mail:  joeh@math.umass.edu.  D.  Towsley  is  with  the  Dept,  of  Computer 
Science,  University  of  Massachusetts,  Amherst,  MA  01003,  USA,  E- 
mail:  towsley@cs.umass.edu 


This  approach  has  been  used  to  infer  the  per  link  packet 
loss  probabilities  for  logical  multicast  trees  with  a  known 
topology.  The  Maximum  Likelihood  Estimator  (MLE) 
for  the  link  probabilities  was  determined  in  [3]  under  the 
assumption  that  probe  loss  occurs  independently  across 
links  and  between  probes.  This  estimate  is  somewhat  ro¬ 
bust  with  respect  to  violations  of  this  assumption.  This 
approach  will  be  discussed  in  more  detail  presently. 

The  focus  of  the  current  paper  is  the  extension  of  these 
methods  to  infer  the  logical  topology  when  it  is  not  known 
in  advance.  This  is  motivated  in  part  by  ongoing  work  [1] 
to  incorporate  the  loss-based  MLE  into  the  National  In¬ 
ternet  Measurement  Infrastructure  [14].  In  this  case,  in¬ 
ference  is  performed  on  end-to-end  measurements  arising 
from  the  exchange  of  multicast  probes  between  a  num¬ 
ber  of  measurement  hosts  stationed  in  the  Internet.  The 
methods  here  can  be  used  to  infer  first  the  logical  multi¬ 
cast  topology,  and  then  the  loss  rates  on  the  links  in  this 
topology.  What  we  do  not  provide  (are  unable  to)  is  an 
algorithm  for  identifying  the  physical  topology  of  a  net¬ 
work. 


A  more  important  motivation  for  this  work  is  that 
knowledge  of  the  multicast  topology  can  be  used  by  mul¬ 
ticast  applications.  It  has  been  shown  in  [9]  that  orga¬ 
nizing  a  set  of  receivers  in  a  bulk  transfer  application 
into  a  tree  can  substantially  improve  performance.  Such 
an  organization  is  central  component  of  the  widely  used 
RMTP-II  protocol  [20].  The  development  of  tree  con¬ 
struction  algorithms  for  the  purpose  of  supporting  reliable 
multicast  has  been  identified  to  be  of  fundamental  impor¬ 
tance  by  the  Reliable  Multicast  Transport  Group  of  the 
lETE;  see  [7].  This  motivated  the  work  reported  in  [16], 
which  was  concerned  with  grouping  multicast  receivers 
that  share  the  same  set  of  network  bottlenecks  from  the 
source  for  the  purposes  of  congestion  control.  Closely  re¬ 
lated  to  [3],  the  approach  of  [16]  is  based  on  estimating 
packet  loss  rates  for  the  path  between  the  source  and  the 
common  ancestor  of  pairs  of  nodes  in  the  special  case  of 
binary  trees.  Since  loss  is  a  non-decreasing  function  of  the 
path  length,  this  quantity  should  be  maximal  for  a  sibling 
pair.  The  whole  binary  tree  is  reconstructed  by  iterating 
this  procedure. 


143 


B.  Contribution. 

This  paper  describes  and  evaluates  three  methods  for 
inference  of  logical  multicast  topology  from  end-to-end 
multicast  measurements.  Two  of  these  ((i)  and  (ii)  below) 
are  directly  based  on  the  MLE  for  link  loss  probabilities 
of  [3],  as  recounted  in  Section  II.  In  more  detail,  the  three 
methods  are: 

(i)  Grouping  Classifiers.  We  extend  the  grouping  method 
of  [16]  to  general  trees,  and  establish  its  correctness.  This 
is  done  in  two  steps.  First,  in  Section  III,  we  apply  and 
extend  the  methods  of  [3]  to  establish  a  one-to-one  corre¬ 
spondence  between  the  the  expected  distribution  of  events 
measurable  at  the  leaves,  and  the  underlying  topology 
and  loss  rates.  In  particular,  we  provide  an  algorithm 
that  reconstructs  arbitrary  (e.g.  non-binary)  topologies 
from  the  corresponding  distributions  of  leaf-measurable 
events.  Second,  in  Section  IV,  we  adapt  the  algorithm 
to  work  with  the  empirical  leaf-event  distributions  arising 
from  multicast  end-to-end  measurements.  A  complica¬ 
tion  arises  through  the  fact  that  certain  equalities  that  hold 
for  the  expected  distributions  only  hold  approximately 
for  the  measured  distributions.  We  propose  and  evalu¬ 
ate  three  variants  of  the  algorithm  to  overcome  this.  One 
is  based  on  the  above  reconstruction  method  for  general 
trees;  the  other  two  methods  use  binary  grouping  opera¬ 
tions  to  reconstruct  a  binary  tree,  which  is  then  manipu¬ 
lated  to  yield  the  inferred  tree. 

(ii)  Maximum  Likelihood  Classifier.  Given  the  measured 
end-to-end  packet  losses,  the  link  loss  estimator  of  [3]  as¬ 
sociates  a  likelihood  with  each  possible  logical  multicast 
tree  connecting  the  source  to  the  receivers.  The  maxi¬ 
mum  likelihood  classifier  selects  that  tree  for  which  the 
likelihood  is  maximal.  This  estimator  is  presented  in  Sec¬ 
tion  V. 

(Hi)  Bayesian  Classifier.  In  this  approach,  the  topology 
and  link  probabilities  are  treated  as  random  variables  with 
some  prior  distribution.  In  Bayesian  decision  theory  one 
specifies  a  loss  function  that  characterizes  a  penalty  for 
misclassification,  then  selects  the  topology  that  minimizes 
the  mean  value  of  this  penalty  according  to  the  posterior 
distribution  (i.e.  the  conditional  distribution  of  the  pa¬ 
rameters  given  the  measurements).  This  estimator  is  pre¬ 
sented  in  Section  VI. 

In  all  cases  we  establish  that  the  classifiers  are  con¬ 
sistent,  i.e.,  the  probability  of  correct  classification  con¬ 
verges  to  1  as  the  number  of  probes  grows  to  infinity.  We 
establish  connections  amongst  the  grouping-based  algo¬ 
rithms.  In  particular,  the  general  grouping-based  algo¬ 
rithm  is  equivalent  to  the  composition  of  the  binary  group¬ 
ing  algorithm  with  a  pruning  operation  that  excises  links 
of  zero  loss  and  identifies  their  endpoints.  The  latter  ap¬ 
proach  turns  out  to  be  computationally  simpler. 

The  ML  and  Bayesian  classifiers,  embodying  standard 
statistical  methods,  provide  reference  points  for  the  ac¬ 


curacy  of  the  grouping-based  classifiers.  In  Section  VII 
we  use  simulations  to  evaluate  the  relative  accuracy  of 
the  topology  classifiers,  and  to  understand  their  modes  of 
failure.  We  find  that  the  accuracy  of  the  grouping  clas¬ 
sifiers  either  closely  matches  or  exceeds  that  of  the  other 
methods  when  applied  to  the  identification  of  a  selection 
of  fixed  unknown  topologies.  This  finding  is  supported 
by  some  numerical  results  on  the  tail  asymptotics  of  mis¬ 
classification  probabilities  when  using  large  numbers  of 
probes.  The  simulations  show  the  techniques  can  resolve 
topologies  even  when  link  loss  probabilities  are  as  small 
as  about  1%,  on  the  basis  of  data  from  a  few  thousand 
probes.  This  data  could  be  gathered  from  a  probe  source 
of  low  bandwidth  (a  few  tens  of  kbits  per  second)  over  a 
few  minutes. 

The  ML  and  Bayesian  classifiers  are  considerably  more 
computationally  complex  than  the  grouping-based  meth¬ 
ods.  This  is  for  two  reasons:  (i)  they  exhaustively  search 
the  set  of  possible  trees,  whereas  the  grouping  approaches 
progressively  exclude  certain  topologies  from  considera¬ 
tion  as  groups  are  formed;  (ii)  their  per-topology  com¬ 
putational  costs  are  greater.  Since  the  number  of  pos¬ 
sible  topologies  grows  rapidly  with  the  number  of  re¬ 
ceivers,  any  decrease  in  per-topology  cost  for  the  ML  and 
Bayesian  classifiers  would  eventually  be  swamped  by  the 
growth  in  the  number  of  possible  topologies.  For  this 
reason,  we  expect  significant  decrease  in  complexity  will 
only  be  available  for  classifiers  that  are  able  to  search  the 
topology  space  in  a  relatively  sophisticated  manner,  e.g. 
as  performed  by  the  grouping-based  algorithms.  Sum¬ 
marizing,  we  conclude  that  binary-based  grouping  algo¬ 
rithms  provide  the  best  combination  of  accuracy  and  com¬ 
putational  simplicity. 

In  Section  VIII  we  further  analyze  the  modes  of  mis¬ 
classification  in  grouping  algorithms.  We  distinguish  the 
coarser  notion  of  misgrouping,  which  entails  failure  to 
identify  the  descendant  leaves  of  a  given  node.  This  no¬ 
tion  is  relevant,  for  example,  in  multicast  congestion  con¬ 
trol,  where  one  is  interested  in  establishing  the  set  of  re¬ 
ceivers  that  are  behind  each  bottleneck.  We  obtain  rates 
of  convergence  of  the  probability  of  successful  grouping 
and  classification  in  the  regime  of  small  link  loss  rates. 

We  conclude  in  Section  IX;  the  proofs  and  some  more 
detailed  technical  material  are  deferred  to  Section  X. 

C.  Other  Related  Work. 

The  mtrace  [12]  measurement  tool,  reports  the  route 
from  a  multicast  source  to  a  receiver,  along  with  other 
information  about  that  path  such  as  per-hop  loss  statis¬ 
tics.  The  tracer  tool  [10]  uses  mtrace  to  perform 
topology  discovery.  We  briefly  contrast  some  properties 
of  those  methods  with  those  presented  here,  (i)  Access: 
mtrace  relies  on  routers  to  respond  to  explicit  measure¬ 
ment  queries;  access  to  such  facilities  may  be  restricted 


144 


by  service  providers.  The  present  method  does  not  re¬ 
quire  such  cooperation,  (ii)  Scaling:  mtrace  needs  to 
run  once  per  receiver  in  order  to  cover  the  tree,  so  that 
each  router  must  process  requests  from  all  its  descendant 
leaf  nodes.  The  present  method  works  with  a  single  pass 
down  the  tree.  On  the  other  hand,  our  methods  do  not  as¬ 
sociate  physical  network  addresses  with  nodes  of  the  logi¬ 
cal  multicast  tree.  For  this  reason,  we  envisage  combining 
mtrace  and  multicast-based  estimation  in  measurement 
infrastructures,  complementing  infrequent  mtrace  mea¬ 
surements  with  ongoing  multicast  based-inference  to  de¬ 
tect  topology  changes. 

In  the  broader  context  of  network  tomography  we  men¬ 
tion  some  recent  analytic  work  on  a  different  problem, 
namely,  determination  of  source-destination  traffic  matrix 
from  source-  and  destination-averaged  traffic  volumes; 
see  [18],  [19]  for  further  details. 

II.  Loss  Trees  and  Inference  of  Loss  Rate 

We  begin  by  reviewing  the  tree  and  loss  models  used  to 
formulate  the  MLE  for  link  loss  probabilities  in  a  known 
topology.  We  identify  the  physical  multicast  tree  as  com¬ 
prising  actual  network  elements  (the  nodes)  and  the  com¬ 
munication  links  that  join  them.  The  logical  multicast  tree 
comprises  the  branch  points  of  the  physical  tree,  and  the 
logical  links  between  them.  The  logical  links  comprise 
one  or  more  physical  links.  Thus  each  node  in  the  logical 
tree  has  at  least  two  children,  except  the  leaf  nodes  (which 
have  none)  and  the  root  (which  we  assume  has  one).  We 
can  construct  the  logical  tree  from  the  physical  tree  by  the 
following  procedure:  except  for  the  root,  delete  each  node 
that  has  only  one  child,  and  adjust  the  link  set  accordingly 
by  linking  its  parent  directly  to  its  child. 

A.  Tree  Model. 

Let  T  =  (y,  L)  denote  a  logical  multicast  tree  with 
nodes  V  and  links  L.  We  identify  one  node,  the  root  0, 
with  the  source  of  probes,  and  set  of  leaves  R  C  V  with 
the  set  of  receivers.  We  say  that  a  link  is  internal  if  nei¬ 
ther  of  its  endpoints  is  the  root  or  a  leaf  node.  We  will 
occasionally  use  W  to  denote  V  \  ({0, 1}  U  R),  where 
1  denotes  the  child  node  of  0,  the  set  of  nodes  terminat¬ 
ing  internal  links.  Each  node  k,  apart  from  the  root,  has 
a  parent  f{k)  such  that  {f{k),k)  €  L.  We  will  some¬ 
times  refer  to  (/(fc),  k)  as  link  k.  Define  recursively  the 
compositions  =  f  °  /"  ^  with  =  /.  We  say  j 
is  descended  from  fc,  and  write  j  -<  k,  if  k  =  for 

some  positive  integer  n.  The  set  of  children  of  k,  namely 
{i  G  y  :  /(i)  =  fc}  is  denoted  by  d{k).  The  (nearest) 
ancestor  a{U)  of  a  subset  [/  C  y  is  the  -< -least  upper 
bound  of  all  the  elements  of  [/.  A  collection  of  nodes  U 
are  said  to  be  siblings  if  they  have  the  same  parent,  i.e., 
if  /(fc)  =  a{U)  Vk  €  U.  A  maximal  sibling  set  com¬ 
prises  the  entire  set  d{k)  of  children  of  some  node  k  €  V. 


T{k)  =  (y  (fc),  L{k))  will  denote  the  subtree  rooted  at  fc; 
R{k)  =  i?  n  y(fc)  is  the  set  of  receivers  in  T{k). 

B.  Loss  Model. 

Eor  each  link  we  assume  an  independent  Bernoulli  loss 
model:  each  probe  is  successfully  transmitted  across  link 
k  with  probability  ce^.  Thus  the  progress  of  each  probe 
down  the  tree  is  described  by  an  independent  copy  of  a 
stochastic  process  X  =  {Xi.)kev  as  follows.  Xq  =  1. 
Xf.  =  1  if  the  probe  reaches  node  k  €  V  and  0  oth¬ 
erwise.  If  Xf.  =  0,  then  Xj  =  0,Vj  G  d{k).  Other¬ 
wise,  P[Xj  =  l\Xf.  =  1]  =  aj  and  P[Xj  =  0\Xf.  = 
1]  =  1  —  aj.  We  adopt  the  convention  cto  =  1  and  de¬ 
note  a  =  {ai)i^v  We  call  the  pair  (T,  a)  a  loss  tree. 
P'r,a  will  denote  the  distribution  of  X  on  the  loss  tree 
(T,  a).  In  what  follows  we  shall  work  exclusively  with 
canonical  loss  trees.  A  loss  tree  is  said  to  be  in  canonical 
form  if  0  <  ctfc  <  1,  Vfc  G  y  except  for  fc  =  0.  Any 
tree  (T,  a)  not  in  canonical  form  can  be  reduced  to  a  loss 
tree,  (T',  a'),  in  canonical  form  such  that  the  distribution 
of  {Xi.)keR  is  the  same  under  the  corresponding  prob¬ 
abilities  Pr,a  and  Pr',a'-  To  achieve  this,  links  k  with 
ctfc  =  1  are  excised  and  their  endpoints  identified.  If  any 
link  k  has  aj.  =  0,  then  Xj  =  0  for  all  j  -<  k,  and  hence 
no  probes  are  received  at  any  receiver  in  R{k).  By  re¬ 
moval  of  subtrees  T{k)  rooted  at  such  k,  we  obtain  a  tree 
in  which  all  probabilities  >  0.  Henceforth  we  shall 
consider  only  canonical  loss  trees. 

C.  Inference  of  Loss  Rates. 

When  a  probe  is  sent  down  the  tree  from  the  root  0, 
we  can  not  observe  the  whole  process  X,  but  only  the 
outcome  {X].)keR  G  H  =  {0, 1}^  that  indicates  whether 
or  not  the  probe  reached  each  receiver.  In  [3]  it  was  shown 
how  the  link  probabilities  can  be  determined  from  the  the 
distribution  of  outcomes  when  the  topology  is  known.  Set 

l{k)  =  PT,a[^  ieR(k)Xj  =  1].  (1) 

The  internal  link  probabilities  a  can  be  found  from  7  = 
{7(fc)  :  k  G  y}  as  follows.  Eor  fc  G  y  let  A{k)  be 
the  probability  that  the  probe  reaches  k.  Thus  Afk)  = 
Wjyk  product  of  the  probabilities  of  successful 

transmission  on  each  link  between  k  and  the  root  0.  Eor 
[/  C  y  we  write  -f{U)  =  P[Vfcgc/  M jeR(k)  Xj  =  1].  A 
short  probabilistic  argument  shows  that  for  any  [/  C  d{k), 

(1  -  'y{U)/A{k))  =  n  (1  -  7(i)M(fc))-  (2) 

jeu 

In  particular,  this  holds  for  U  =  d{k)  in  which  case 
7(C/)  =  7(fc).  It  can  be  shown  for  canonical  loss  trees 
that  A{k)  is  the  unique  solution  of  (2);  see  Lemma  1  in 
[3]  or  Prop  1  below.  Thus  given  {7(fc)  :  k  G  V}  one 
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can  find  {A{k))keV7  hence  a,  by  taking  appropriate 
quotients. 

Let  X  =  with  be 

the  set  of  outcomes  arising  from  the  dispatch  of  n  probes 
from  the  source.  We  denote  the  log-likelihood  function  of 
this  event  by 

£(T,a)  =  logPr,a[x]  (3) 

Construct  the  empirical  distributions  7(fc)  = 

Sto=i i-®-  '^he  fraction  of  the  n 
probes  that  reaches  some  receiver  descended  from  k.  Let 
A  denote  the  corresponding  solution  of  (2)  obtained  by 
using  7  in  place  of  7,  and  a  the  corresponding  probabili¬ 
ties  obtained  by  taking  quotients  of  the  A.  The  following 
results,  the  proof  of  which  can  be  found  in  [3],  holds. 

Theorem  1:  Let  T  be  a  canonical  loss  tree. 

(i)  The  loss  model  is  identifiable,  i.e.  Pr,a  =  PT,a'  im¬ 
plies  a  =  a' . 

(ii)  with  probability  1,  for  sufficiently  large  n.  A,  a  are 
the  Maximum  Likelihood  Estimators  of  A,  a,  i.e., 

a  =  arg  max  £ (T,  a).  (4) 

a 

As  a  consequence  of  the  MLE  property,  A  is  consistent 
(A  vvith  probability  1),  and  asymptotically  nor¬ 

mal  (^/n{A  —  A)  converges  in  distribution  to  a  multivari¬ 
ate  Gaussian  random  variable  as  n  — >•  00),  and  similarly 
for  a;  see  [17]. 

III.  Deterministic  Reconstruction  of  Loss 
Trees  by  Grouping 

The  use  of  estimates  of  shared  loss  rates  at  multicast  re¬ 
ceivers  has  been  proposed  recently  in  order  to  group  mul¬ 
ticast  receivers  that  share  the  same  set  of  bottlenecks  on 
the  path  from  the  source  [16].  The  approach  was  formu¬ 
lated  for  binary  trees,  with  shared  loss  rates  having  the 
direct  interpretation  of  the  loss  rate  on  the  path  from  the 
root  to  the  (nearest)  ancestor  of  two  receivers.  Since  the 
loss  rate  cannot  decrease  as  the  path  is  extended,  the  pair 
of  receivers  for  which  shared  loss  rate  is  greatest  will  be 
siblings;  if  not  then  one  of  the  receivers  would  have  a  sib¬ 
ling  and  the  shared  loss  rate  on  the  path  to  their  ancestor 
would  be  greater.  This  maximizing  pair  is  identified  as  a 
pair  of  siblings  and  replaced  by  a  composite  node  that  rep¬ 
resents  their  parent.  Iterating  this  procedure  should  then 
reconstmct  the  binary  tree. 

In  this  section  and  the  following  section  we  establish 
theoretically  the  correctness  of  this  approach,  and  extend 
it  to  cover  general  trees,  i.e.,  those  with  nodes  whose 
branching  ratio  may  be  greater  than  two.  In  this  sec¬ 
tion  we  describe  how  canonical  loss  trees  are  in  one-to- 
one  correspondence  with  the  probability  distributions  of 
the  random  variables  {Xk)keR  visible  at  the  receivers. 
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Fig.  1.  B(U')  >  B{U)  where  U'  =  U  U  {A;}.  Adjoining  the  non¬ 
sibling  node  ktoU  increases  the  value  of  B\  see  Prop.  l(iv). 


Thus  the  loss  tree  can  be  recovered  from  the  receiver 
probabilities.  This  is  achieved  by  employing  an  analog 
of  the  shared  loss  for  binary  trees.  This  is  a  function 
B{U)  of  the  loss  distribution  at  a  set  of  nodes  U  that 
is  minimized  when  [/  is  a  set  of  siblings,  in  which  case 
B{U)  =  A(a(C/))),  i.e.  the  complement  of  the  shared 
loss  rate  to  the  nodes  U .  In  the  case  of  binary  trees,  we  can 
identify  the  minimizing  set  U  as  siblings  and  substitute  a 
composite  node  that  represents  their  parent.  Iterating  this 
procedure  should  then  reconstruct  the  tree.  The  definition 
and  relevant  properties  of  the  function  B  are  given  in  the 
following  proposition. 

Proposition  1:  Let  T  =  (V,  L)  be  a  canonical  loss 
tree,  and  let  [/  CV  with  #U  >  1. 

(i)  The  equation  (1  -  'yiU)/B)  =  n^et/  (1  -  l{k)/B) 
has  a  unique  solution  B{U)  >  'y{U). 

(ii)  Let  B  >  'y{U).  Then  {l  —  'y{U)/B)  > 

(1  -  iffB>B{U). 

(Hi)  B{U)  =  A{a{U))  if  [/  is  a  set  of  siblings,  and  hence 
B{U)  takes  the  same  value  for  any  sibling  set  with  a  given 
parent. 

(iv)  Let  [/  be  a  set  of  siblings,  and  suppose  fc  G  E  is  such 
that  a{U  U  {fc})  >-  a{U)  and  a{U  U  {fc})  >-  k.  Then 
B{ULl{k})  >  B{U). 

Proposition  l(iv)  shows  that  adjoining  a  non-sibling 
non-ancestor  node  to  a  set  of  siblings  can  only  increase 
the  value  of  see  Eigure  1.  This  provides  the  means  to 
reconstruct  the  tree  T  directly  from  the  {'y{U)  :  U  C  R}. 
We  call  the  procedure  to  do  this  the  Deterministic  Loss 
Tree  Classification  Algorithm  (DLT),  specified  in  Eig- 
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Input:  The  set  of  receivers  R  and  associated 
probabilities  {liU)  :  U  C  R}; 

R'  :=  R;  V'  :=  R;  L'  :=  0; 

3.  foreach  j  G  i?'  do  B(j)  :=  7(i);  enddo 

4.  while  |i?'|  >  1  do 

5.  select  U  =  {wi,  1*2}  C  R'  that  minimizes  B{U)\ 

6.  while  there  exists  u  €  R'  \  U  such  that 
B{U  U  {u})  =  B{U)  do 

7.  U:=ULl{uy, 

8.  enddo 

9.  V  :=  V  U  {[/};  R'  :=  (R'  \U)U  {[/}; 

10.  foreach  j  e  U  do 

11.  L'  :=  L'  U  {(C/,i)};  a'U)  :=  B{j)IB{U)- 

12.  enddo 

13.  enddo 

14.  V  =  V'\J  {0}; 

15.  i'  =  i'U{0,C/}; 

16.  a'jj  =  B{U);  ag  =  1; 

17.  OMfpMf.- loss  tree  ((y',  I/'),  a'). 

Fig.  2.  Deterministic  Loss  Tree  Classification  Algorithm  (DLT). 


1. 

Input:  a  loss  tree  (T,  a); 

2. 

Parameter:  a  threshold  £  > 

0; 

3. 

V'  :=  {0}  U  dr(0);  L'  := 

{(0,fc) 

:  k  G  d'f{0)}'. 

4. 

U  :=dr(0); 

5. 

while  ^  0  do 

6. 

select  j  £  U; 

7. 

U:=U\{j}UdT{j) 

8. 

if  ((1  -  aj)  <  £)  A  (i 

^  R)  then 

9. 

L':=(L'U{(/r- 

ij)^k)  : 

'■  k  G  dr{j)})\ 

10. 

D 

jl 

drUy 

11. 

else 

12. 

L' :=L'U{ij,k) 

■  k  G  drU)}; 

13. 

V  :=  V  U  drU) 

14. 

endif; 

15. 

enddo 

16. 

Output:  {{V ,L'),a') 

Fig.  3.  Tree  Pruning  Algorithm  TP(e) 


ure  2;  it  works  as  follows.  At  the  start  of  each  while  loop 
from  line  4,  the  set  R'  comprises  those  nodes  available 
for  grouping.  We  first  find  the  pair  U  =  {ui,U2}  that 
minimizes  B{U)  (line  5),  then  progressively  adjoin  to  it 
further  elements  that  do  not  increase  the  value  of  B  (lines 
6  and  7).  The  members  of  the  largest  set  obtained  this  way 
are  identified  as  siblings;  they  are  removed  from  the  pool 
of  nodes  and  replaced  by  their  parent,  designated  by  their 
union  U  (line  9).  Links  connecting  U  to  its  children  (i.e. 
members)  are  added  to  the  tree,  and  the  link  loss  proba¬ 
bilities  are  determined  by  taking  appropriate  quotients  of 
B's  (line  11).  This  process  is  repeated  until  all  sibling 
sets  have  been  identified.  Finally,  we  adjoin  the  root  node 
0  and  the  link  joining  it  to  its  single  child  (line  14). 

Theorem  2:  (i)  DLT  reconstructs  any  canonical  loss 
tree  (T,  a)  from  its  receiver  set  R  and  the  associated 
probabilities  {'y{U)  :  U  C  R}. 

(ii)  Canonical  loss  trees  are  identifiable,  i.e.  Pr,a  = 
Pr',a'  implies  that  (T,  a)  =  (T',  a'). 

Although  we  have  not  shown  it  here,  it  is  possible  to 
establish  that  any  set  R'  present  at  line  4  of  DLT  has  the 
property  that  minj/cij/  B{U)  is  achieved  when  U  is  a  sib¬ 
ling  set.  Consequently  one  could  replace  steps  5-8  of 
DLT  by  simply  finding  the  maximal  sibling  set,  i.e.  se¬ 
lect  a  maximal  U  C  R'  that  minimizes  B{U).  However, 
this  approach  would  have  worse  computational  properties 
since  it  requires  inspecting  every  subset  of  R' . 

B{U)  is  a  root  of  the  polynomial  of  degree  #[/  —  1 
from  Prop.  l(i).  For  a  binary  subset,  B{{j,  fc})  is  written 


down  explicitly 


lUhik) 

7(fc)  -I-  7(i)  -  7({i,  k})  ’ 


(5) 


Calculation  of  B{U)  requires  numerical  root  finding 
when  #[/  >  5.  However,  it  is  possible  to  recover  T 
in  a  two  stage  procedure  that  requires  the  calculation  of 
B{U)  only  on  binary  sets  U.  The  first  stage  uses  the  De¬ 
terministic  Binary  Loss  Tree  (DBLT)  Classification  Algo¬ 
rithm.  DBLT  is  identical  to  DLT  except  that  grouping  is 
performed  only  over  binary  trees,  thus  omitting  lines  6-8 
in  Figure  2.  The  second  stage  is  to  use  a  Tree  Pruning 
(TP)  Algorithm  on  the  output  of  the  DBLT.  TP  acts  on 
a  loss  tree  ((L,  L),a)  by  removing  from  L  each  internal 
link  (/(fc),  k)  with  loss  rate  1  —  aj.  =0  and  identifying 
its  endpoints  k,f{k).  We  will  find  it  useful  to  specify  a 
slightly  more  general  version:  for  £  >  0,  TP(£)  prunes 
link  k  when  1  —  <  £.  We  formally  specify  TP(£) 

in  Figure  3.  In  Section  X  we  prove  that  composing  the 
binary  algorithm  DBLT  with  pruning  recovers  the  same 
topology  as  DLT  for  general  canonical  loss  trees: 

Theorems:  DLT=TP(0)oDBLT  for  canonical  loss 
trees. 


IV.  Inference  of  Loss  Tree  erom  Measured 
Leaf  Probabilities 

In  this  section,  we  present  algorithms  which  adapt  DLT 
to  use  the  measured  probabilities  corresponding  to  the  7. 
Let  denote  the  measured  outcomes  arising 

from  each  of  n  probes.  Define  the  processes  recur- 
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I .  Input:  The  set  of  receivers  R,  number  of  probes  n, 

receiver  traces  ; 

R'  :=  i?,  V'  :=  R]  L'  =  0; 

foreach  k  £  R,  do 

%):=n-iEr=i4*^ 

foreach  i  =  1, . . . ,  n,  do  ;  enddo  ; 

enddo 

while  |i?'|  >  1  do 

8.  select  U  =  {wi,  1*2}  C  R'  that  minimizes 

^  Y'n  y(')  T'" 

T}(TT]  —  ^i=l  '■l  ^i=l  '■2  ■ 

9.  foreach  i  =  1, . . . ,  n  do  =  MueuYu^  enddo 

10.  y  :=yu{C/};i?':=(i?'\C/)U{C/}; 

I I .  foreach  u  £U  do 

12.  L'  :=L'U{(C/,w)};S„  :=  B{u) / B{U); 

13.  enddo 

14.  enddo 

15.  y'  =  y'  U  {0};  L'  =  L'  U  {0,  U}; 

16.  S(7  =  B{U)-,  So  =  1; 

17.  Output:  loss  tree  {{V ,  L'),  a). 


Fig.  4.  Binary  Loss  Tree  Classification  Algorithm  (BLT). 


sively  by 

with  y/')  =  X«,  k£R.  (6) 

Thus  =  1  iff  probe  i  was  received  at  some  re¬ 
ceiver  descended  from  k;  j{k)  =  ^"=1 

fraction  of  the  probes  1 , . . . ,  n  that  reach  some  receiver 
descended  from  k.  For  U  C  V  we  define  ^{U)  = 

EiLi  '^jeuYj^^  analogously;  7(C7)  is  the  fraction 
of  probes  that  reach  some  receiver  descended  from  some 
node  in  U.  Let  B{U)  be  the  unique  solution  in  Prop.  l(ii) 
obtained  by  using  7  in  place  of  7.  We  will  use  the  no¬ 
tation  (T,  S)  to  denote  an  inferred  loss  tree;  sometimes 
we  will  use  7x  to  distinguish  the  topology  inferred  by  a 
particular  algorithm  X.  will  denote  the  probability  of 
false  identification  of  topology  T  of  the  loss  tree  (T,  a) 
i.e.  Pi  =  Pr.a[fx  n 

Theorem  4:  Let  {{V,L),a)  be  a  loss  tree.  Then 
lim„_>oo  B{U)  =  B{U)  for  each  U  CV. 

A.  Classification  of  Binary  Loss  Trees 


an  incorrect  topology  T  ^  T,  the  definitions  of  quantities 
such  as  BiU)  and  Y^f^  extend  evidently  to  subsets  U  of 
nodes  in  the  incorrect  topology  T'  ■  The  following  theo¬ 
rem  establishes  the  consistency  of  the  estimator  7blt;  the 
proof  appears  in  Section  X. 

Theorem  5:  Let  (T,  a)  be  a  binary  canonical  loss  tree. 
With  probability  1,  7blt  =  T  for  all  sufficiently  large  n, 
and  hence  lim„_>oo  F’blt  “ 

B.  Classification  of  General  Loss  Trees 

The  adaptation  of  DLT  to  the  classification  of  general 
loss  trees  from  measured  leaf  probabilities  is  somewhat 
more  complicated  than  the  binary  case.  It  is  shown  during 
the  proof  of  Theorem  5  that  the  B{U)  have  the  same  rela¬ 
tive  ordering  as  the  B{U)  for  n  sufficiently  large.  But  for 
a  general  tree  (y,  L),  B{U')  takes  the  same  value  for  any 
subset  U'  of  a  maximal  sibling  set  U  C  V.  For  finitely 
many  probes,  the  corresponding  {B{U')  :  U'  C  U}  will 
not  in  general  be  equal.  Hence  choosing  to  group  the  sub¬ 
set  U'  that  minimizes  13{-)  will  not  necessarily  group  all 
the  siblings  in  U. 

In  this  section  we  present  three  algorithms  to  clas¬ 
sify  general  trees.  Each  of  these  overcomes  the  prob¬ 
lem  described  in  the  previous  paragraph  by  incorporat¬ 
ing  a  threshold  into  the  grouping  procedure.  The  set  U 
is  grouped  if  B{U)  is  sufficiently  close  to  being  mini¬ 
mal.  However,  this  can  also  give  rise  to  false  inclusion 
by  effectively  ignoring  internal  links  whose  loss  rates  do 
not  exceed  the  threshold.  The  variety  of  algorithms  de¬ 
rives  from  different  ways  to  implement  the  threshold.  We 
establish  domains  in  which  the  algorithms  correctly  clas¬ 
sify  canonical  loss  trees.  In  succeeding  sections  we  eval¬ 
uate  their  relative  efficiencies  and  compare  their  modes 
and  frequencies  of  false  classification. 

B.l  Binary  Loss  Tree  Pruning  Classification  Algorithm 
BLTP. 

Nodes  are  grouped  as  if  the  tree  were  binary,  the  result¬ 
ing  tree  is  pruned  with  TP(£)  to  remove  all  internal  links 
with  loss  probabilities  less  than  or  equal  to  the  threshold 
£  >  0.  Thus  for  each  £  >  0  we  define  BLTP(£)  to  be 
the  composition  TP(£)oBLT.  A  refinement  BLTP'(£)  of 
BLTP(£)  is  to  recalculate  the  loss  probabilities  a'  based 
on  the  measurements  and  the  pruned  topology  T'  ■ 


The  adaptation  of  DLT  is  most  straightforward  for  bi¬ 
nary  trees.  By  using  B  in  place  of  B  in  DLT  and  re¬ 
stricting  the  minimization  of  B  to  binary  sets  we  obtain 
the  Binary  Loss  Tree  (BLT)  Classification  Algorithm;  we 
specify  it  formally  in  Figure  4.  This  is,  essentially,  the  al¬ 
gorithm  proposed  in  [16].  We  have  taken  advantage  of  the 
recursive  structure  of  the  Y^f"^  (in  line  9)  in  order  to  calcu¬ 
late  the  probabilities  7.  Note  that  when  BLT  reconstructs 


B.2  Binary  Loss  Tree  Clique  Classification  Algorithm 

BLTC. 

For  each  £  >  0,  BLTC(£)  groups  by  forming  maximal 
sets  of  nodes  U  in  which  all  binary  subsets  U'  have 
sufficiently  close  to  the  true  minimum  over  all  binary  sets. 
This  amounts  to  replacing  line  8  in  Figure  4  with  the  fol¬ 
lowing  steps: 

(i)  select  U'  =  {w',  v'}  that  minimizes  B{U'); 
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(ii)  construct  the  graph  G  of  all  links  (u",  v")  such  that 
{l-s)B{{u",v"})  <  B{U'y, 

(Hi)  select  U  comprising  the  elements  of  the  largest  con¬ 
nected  component  of  G  that  contains  U' . 

Note  that  if  the  grouping  is  done  correctly,  then 
B{{u',v'})  takes  the  same  value  for  all  binary  subsets 
{u',v'}  of  U.  For  finite  but  large  n,  the  corresponding 
sampled  B{{u',  v'})  will  differ  slightly. 

B.3  General  Loss  Tree  Classification  Algorithm  GLT. 

For  each  £  >  0,  GLT(£)  is  a  modification  of  DLT  that 
employs  a  threshold  e  to  perform  the  grouping  based  on 
B.  Each  grouping  step  starts  by  finding  a  binary  set 
{^1,^2}  of  minimal  B,  then  adjoining  further  elements 
to  it  provided  the  resulting  set  U  satisfies  B{U){1  —  e)  < 
B{{ui,U2})-  The  violation  of  this  condition  has  the 
interpretation  that  the  ancestor  a{U)  is  separated  from 
a({ui,  U2})  by  a  link  with  loss  rate  at  least  e.  Thus  we 
replace  line  8  of  Figure  4  by  the  following. 

8a.  select  U  :=  {^1,^2}  C  S  that  minimizes  B{-)\ 

8b.  while  there  exists  u  €  R'  \U  such  that 
(1  -  s)B{U  U  {u})  <  B{ui,U2)}  do 
8c.  U  =  U  U  {u}; 

8d.  enddo 

For  clarity  we  have  omitted  the  details  of  the  dependence 
of  ]3  on  the  7;  these  are  as  described  before  Theorem  4. 

B.4  Convergence  of  General  Loss  Tree  Classifiers. 

As  the  number  of  probes  grows,  the  topology  estimates 
furnished  by  BLTP(£),BLTC(£)  and  GLT(£)  converge  to 
the  true  topology  provided  all  internal  link  loss  probabil¬ 
ities  are  greater  than  e.  This  happens  for  the  same  reason 
as  it  does  in  BLT.  It  is  not  difficult  to  see  that  the  determin¬ 
istic  versions  of  each  algorithm,  obtained  by  using  B  in 
place  of  B,  reconstruct  the  topology.  Since  B  converges 
to  B  as  the  number  of  probes  grows,  the  same  is  true  for 
the  classifiers  using  B.  We  collect  these  results  without 
further  proof: 

Theorem  6:  Let  (T,  a)  be  a  loss  tree  in  which  all 
loss  probabilities  1  —  >  e' ,  k  €  W,  for  some 

e'  >  0.  For  each  e  G  (0,£')  and  each  algorithm  X  G 
{BLTP(£),  BLTC(£),  GLT(£)},  with  probability  1,  7x  = 
T  for  all  sufficiently  large  n,  and  hence  lim„_>oo  =  0. 

Convergence  to  the  true  topology  requires  e  to  be 
smaller  than  the  internal  link  loss  rates,  which  are  typ¬ 
ically  not  known  in  advance.  A  very  small  value  of  e 
is  more  likely  to  satisfy  the  above  condition  but  at  the 
cost,  as  shown  in  Section  VIII,  of  slower  classifier  conver¬ 
gence.  A  large  value  of  £,  on  the  other  hand,  is  more  likely 
to  result  in  systematically  removing  links  with  small  loss 
rates.  In  practice,  however,  we  believe  that  the  choice  of 
£  does  not  pose  a  problem.  We  expect,  indeed,  that  for 


many  applications  while  it  is  important  to  correctly  iden¬ 
tify  links  with  high  loss  rate,  it  could  be  considered  ac¬ 
ceptable  failure  to  detect  those  with  small  loss  rates.  In 
other  words,  in  practice,  it  could  be  sufficient  the  con¬ 
vergence  of  the  inferred  topology  to  =  TP(£)(T) 
obtained  from  T  by  ignoring  links  whose  loss  rates  fell 
below  some  specific  value  e  which,  in  this  case,  would  be 
regarded  as  some  application-specific  minimum  loss  rate 
of  interest. 

The  results  below  establish  the  desired  convergence  to 
for  any  e  G  (0, 1)  provided  e  ^  a^,  k  £  W.  The  key 
observation  is  that  since  the  deterministic  versions  of  each 
algorithm  reconstruct  so  does  each  algorithm,  as  the 
number  of  probes  grows.  Denote  P^(£)  =  Pr,a[7x  ^ 
T*^].  Without  further  proof  we  have: 

Theorem  7:  Let  (T,  a)  be  a  loss  tree.  For  each  £  G 
(0, 1),  such  that  £  ^  ak,  k  G  W,  and  for  each  algorithm 
X  G  {BLTP(£),  BLTC(£),  GLT(£)},  then  with  probabil¬ 
ity  1,  7x  =  =  TP(£)(T)  for  all  sufficiently  large  n, 

and  hence  lim„_>oo  P^  (s)  =  0. 

C.  Ejfects  of  Model  Violation 

The  two  underlying  statistical  assumptions  are  (i) 
probes  are  independent;  and  (ii)  conditioned  on  a  probe 
having  reached  a  given  node  k,  the  events  of  probe  loss 
on  distinct  subtrees  descended  from  k  are  independent. 
We  now  discuss  the  impact  of  violations  of  these  assump¬ 
tions. 

The  first  observation  is  that  the  estimators  remain  con¬ 
sistent  under  the  introduction  of  some  temporal  depen¬ 
dence  between  probes,  i.e.  under  violation  of  assumption 
(i)  above.  Assuming  the  loss  process  to  be  ergodic,  7  still 
converges  to  7  almost  surely,  as  the  number  of  probes  n 
grows.  However,  rates  of  convergence  can  be  slower,  and 
hence  the  variance  of  ]3  can  be  higher,  than  for  the  inde¬ 
pendent  case.  This  would  increase  the  misclassification 
probabilities  for  inference  from  a  given  number  of  probes 
n. 

On  the  other  hand,  spatial  dependence  of  loss  (i.e.  vi¬ 
olations  of  assumption  (ii)  above)  can  lead  to  bias.  We 
take  spatial  loss  dependence  to  be  characterized  by  de¬ 
parture  from  zero  of  an  appropriate  set  of  loss  correlation 
coefficients.  By  extending  an  argument  given  for  binary 
trees  in  [3,  Theorem  7],  it  can  be  shown  that  the  limit 
quantities  B'  =  lim„_>(x>  B  deform  continuously  away 
from  the  quantities  B  of  the  spatially  independent  case 
as  the  loss  correlation  coefficients  move  away  from  zero. 
Hence  a  given  canonical  loss  tree  can  be  recovered  cor¬ 
rectly  by  applying  DBLT  to  the  quantities  B'  provided 
the  spatial  dependence  is  sufficiently  small,  i.e.,  to  make 
the  B'  sufficiently  close  to  B'  so  that  B{Ui)  >  B{U2) 
iff  B'  (Ui)  >  B'{U2)  for  all  relevant  subsets  of  nodes  Ui 
and  U2  ■  Then  by  a  similar  argument  to  that  of  Theorem  5, 
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a  tree  with  link  loss  rates  greater  than  some  £  >  0,  is  re¬ 
covered  by  BLTP(£)  with  probability  1  for  a  sufficiently 
large  number  n  of  probes,  and  sufficiently  small  spatial 
correlations. 

We  remark  that  the  the  experiments  reported  in  Sec¬ 
tions  VII  and  VIII  use  network  level  simulation  rather 
than  model  based  simulation.  Hence  it  is  expected  that 
the  model  assumptions  will  be  violated  to  some  extent. 
Nevertheless,  the  classifiers  are  found  to  be  quite  accu¬ 
rate. 

V.  Maximum-Likelihood  Classifier 

Let  T (R)  denote  the  set  of  logical  multicast  trees  with 
receiver  set  R.  Denote  by  aj-  the  MLE  of  a  in  (4)  for 
the  topology  R.  The  maximum-likelihood  (ML)  classifier 
assigns  the  topology  7ml  that  maximizes  £(T,  S7-): 

Tml  =  arg  max  £(T,  Sr)  ■  (7) 

TeT(R) 

We  prove  that,  if  the  link  probabilities  are  bounded  away 
from  0  and  1,  the  ML-classifier  is  consistent  in  the  sense 
that,  w.p.  1,  it  identifies  the  correct  topology  as  the  num¬ 
ber  of  probes  grows  to  infinity.  For  £  >  0,  let  =  {a  : 
k  £V  \  {0}}. 

Theorem  8:  Let  £  >  0  and  let  (T,  a)  be  a  loss  tree  with 
a  G  Aif.  Then  lim„_,oo  Pr,a['^L  ^  7^  =  0. 

VI.  Loss-Based  Bayesian  Tree  Classifier 

Let  T (R)  denote  the  set  of  logical  multicast  topologies 
having  a  given  receiver  set  R.  A'(f  from  Section  V,  is  the 
set  of  possible  loss  rates  in  the  topology  T.  A  possible 
loss  tree  with  topology  in  T{R)  is  an  element  of  the  pa¬ 
rameter  space 

0  =  Ur6r(fl)({^}  X 

Let  7r(r,  a)  be  a  prior  distribution  on  0.  Given  receiver 
measurements  x  =  ,  ■  ■  ■ ,  the  posterior  distri¬ 

bution  on  0  is 

7r(r,  a\x)  =  7r(r,  a)/(x|r,  a)/f{x),  (9) 

where  /(x|r,  a)  =  is  the  joint  density  of  the  ob¬ 

servations  and  /(x)  their  marginal  density. 

A  decision  rule  5  provides  an  estimate  6(x)  G  0  of 
the  loss  tree  given  receiver  measurements  x.  The  quality 
of  a  decision  rule  is  evaluated  in  terms  of  a  loss  function 
H{6,6'),a  nonnegative  function  on  0  x  0  interpreted  as 
the  loss  incurred  by  deciding  that  6'  is  the  true  parameter 
when,  in  fact,  it  is  6.  A  measure  of  quality  of  a  decision 
mle  6  is  its  Bayes  risk  R{6)  =  E{H{6,  S{x)),  where  the 
expectation  is  taken  with  respect  to  the  joint  distribution 
n{T,a)f{x  I  T,a).  of  the  loss  tree  6  =  {T,a)  and  the 
observations  x.  The  Bayes  decision  rule  6b  is  the  one 


that  minimizes  R{6):  it  has  least  average  loss.  A  standard 
theorem  in  decision  theory  gives  6b  in  the  form: 

=  argmin  f  H(6,6')Tt(6\x)  d0,  (10) 

0'e0  Je 

i.e.,  it  is  the  minimizer  of  the  posterior  risk,  which  is 
the  expected  loss  with  respect  to  the  posterior  distribu¬ 
tion  7r(0|x);  see  Prop.  3.16  of  [17]  and  Section  4.4,  result 
1  of  [2]. 

Since  our  interest  is  in  identifying  the  correct  topology, 
we  choose  the  loss  function 
H{{t,  a),  {t' ,a'))  =  x[t  ^  r']  where  x  is  the  indica¬ 
tor  function,  i.e.,  no  loss  for  a  correct  identification  of  the 
topology,  and  unit  loss  for  any  misidentification.  Here, 
the  loss  rates  a  play  the  role  of  a  nuisance  parameter.  The 
Bayes  classifier  for  the  topology  becomes  7b  =  Tb{x), 
where 

7B(a:)  =  argmin  P[r' ^  r  I  x],  (11) 

T'eTiR) 

or,  equivalently, 

Tb{x)  =  argmax  P[r'  =  r  |  x].  (12) 

T'eTiR) 

Thus  the  Bayes  classifier  7b  yields  the  topology  with 
maximum  posterior  probability  given  the  data  x.  By 
definition,  this  classifier  minimizes  the  misclassification 
probability. 

A  special  case  is  the  uniform  prior  in  which  all  topolo¬ 
gies  in  T (R)  are  taken  to  be  equally  likely,  and  for  each 
topology  r,  a  is  distributed  uniformly  on  A^ .  The  corre¬ 
sponding  prior  distribution,  7r(r,  a)  =  xa°  (ct) / #T (R), 
is  a  non-informative  prior,  expressing  “maximum  igno¬ 
rance”  about  the  tree  topology  and  link  probabilities. 
Clearly  if  other  prior  information  is  available  about  the 
tree,  it  may  be  incorporated  into  a  non-uniform  prior  dis¬ 
tribution.  The  Bayes  classifier  becomes 

7B(a:)  =  argmax  /  /(x|r',a)da.  (13) 

T'eTiR)  Ja^, 

This  should  be  compared  with  the  ML  classifier  in  (7). 

A.  Consistency  of  Pseudo-Bayes  Classifiers. 

In  practice  our  task  is  to  identify  the  specific  topology 
giving  rise  to  a  set  of  measured  data.  When  no  prior  distri¬ 
bution  is  specified,  the  concept  of  the  Bayes  classifier,  as 
the  maximizer  of  the  probability  of  correct  classification, 
does  not  make  sense,  because  “the”  probability  of  correct 
classification  is  not  defined.  Nonetheless  it  may  be  conve¬ 
nient  to  construct  a  pseudo-Bayes  classifier  by  choosing  a 
distribution  tt  on  0,  which  plays  the  role  of  a  prior,  and 
forming  the  classifier  in  (10),  which  we  now  denote  by 
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Fig.  5.  Network-level  simulation  topology  for  ns.  Links  are  of  two 
types:  edge  links  of  IMb/s  capacity  and  10ms  latency,  and  interior 
links  of  5Mb/s  capacity  and  50ms  latency. 

T-k.  Classifiers  constructed  in  this  way  are  also  consistent 
under  a  mild  condition. 

Theorem  9:  Let  tt  be  a  prior  distribution  on  0,  and  as¬ 
sume  that  (T,  a)  lies  in  the  support  of  tt.  Then  7^  is  con¬ 
sistent  in  the  frequentist  sense,  i.e.,  Pr,a[7^  7^  ^  0 

as  n  — >•  00. 

VII.  Simulation  Evaluation  and  Algorithm 
Comparison 

A.  Methodology. 

We  used  two  types  of  simulation  to  verify  the  accuracy 
of  the  classification  algorithms  and  to  compare  their  per¬ 
formance.  In  model-based  simulation,  packet  loss  occurs 
pseudorandomly  in  accordance  with  the  independence  as¬ 
sumptions  of  the  model.  This  allows  us  to  verify  the  pre¬ 
diction  of  the  model  in  a  controlled  environment,  and  to 
rapidly  investigate  the  performance  of  the  classifiers  in  a 
varied  set  of  topologies. 

This  approach  was  complemented  by  network-level 
simulations  using  the  ns  [13]  program;  these  allow 
protocol-level  simulation  of  probe  traffic  mixed  in  with 
background  traffic  of  TCP  and  UDP  sessions.  Losses  are 
due  to  buffer  overflow,  rather  than  being  generated  by  a 
model,  and  hence  can  violate  the  Bernoulli  assumptions 
underlying  the  analysis.  This  enables  us  to  test  the  robust¬ 
ness  to  realistic  violations  of  the  model.  For  the  ns  simu¬ 
lations  we  used  the  topology  shown  in  Figure  5.  Links  in 
the  interior  of  the  tree  have  higher  capacity  (5Mb/sec)  and 
latency  (50ms)  than  those  at  the  edge  (IMb/sec  and  10ms) 
in  order  to  capture  the  heterogeneity  between  edges  and 
core  of  a  wide  area  network.  Probes  were  generated  from 


node  0  as  a  Poisson  process  with  mean  interarrival  time 
16ms.  Background  traffic  comprised  a  mix  of  infinite  FTP 
data  sources  connecting  with  TCP,  and  exponential  on-off 
sources  using  UDP.  The  amount  of  background  traffic  was 
tuned  in  order  to  give  link  loss  rates  that  could  have  sig¬ 
nificant  performance  impact  on  applications,  down  to  as 
low  as  about  1%.  One  strength  of  our  methodology  is 
its  ability  to  discern  links  with  such  small  but  potentially 
significant  loss  rates.  In  view  of  this,  we  will  find  it  con¬ 
venient  to  quote  all  loss  rates  as  percentages. 

B.  Performance  of  Algorithms  Based  on  Grouping 
B.  1  Dependence  of  Accuracy  on  Threshold  e. 

We  conducted  100  ns  simulations  of  the  three  algo¬ 
rithms  BLTP,BLTC  and  GLT.  Link  loss  rates  ranged  from 
1.8%  to  10.9%  on  interior  links;  these  are  the  links  that 
must  be  resolved  if  the  tree  is  to  be  correctly  classified.  In 
Figures  6-1 1  we  plot  the  fraction  of  experiments  in  which 
the  topology  was  correctly  identified  as  a  function  of  the 
number  of  probes,  for  the  three  algorithms,  and  for  se¬ 
lected  values  of  e  between  0.25%  and  5%.  Accuracy  is 
best  for  intermediate  e,  decreasing  for  larger  and  smaller 
£.  The  explanation  for  this  behavior  is  that  smaller  val¬ 
ues  of  £  lead  to  stricter  criteria  for  grouping  nodes.  With 
finitely  many  samples,  for  small  £,  sufficiently  large  fluc¬ 
tuations  of  the  13  cause  erroneous  exclusion  of  nodes. 
By  increasing  £,  the  threshold  for  group  formation  is  in¬ 
creased  and  so  accuracy  is  initially  increased.  However, 
as  £  approaches  the  smallest  interior  link  loss  rate,  large 
fluctuations  of  the  13  now  cause  erroneous  inclusion  of 
nodes  into  groups.  When  e  is  moved  much  beyond  the 
smallest  interior  loss  rate,  the  probability  of  correct  clas¬ 
sification  falls  to  zero.  The  behavior  is  different  if  we 
ignore  failures  to  detect  links  with  loss  rates  smaller  than 
£.  For  £  =  5%  and  e  =  7%,  in  Figure  12  and  13,  re¬ 
spectively,  we  plot  the  fraction  of  experiments  in  which 
the  pruned  topology  was  correctly  identified  for  the 
three  algorithms.  Here  the  accuracy  depends  on  the  rela¬ 
tive  values  of  e  and  the  internal  link  loss  rates.  In  these 
experiments,  the  actual  loss  rates  was  often  very  close  to 
5%,  so  that  small  fluctuations  results  in  erroneous  inclu¬ 
sions/exclusions  of  nodes  which  accounts  for  the  signif¬ 
icant  fraction  of  failures  for  e  =  5%.  In  Section  VIII-B 
we  shall  analyze  this  behavior  and  obtain  estimates  for  the 
probabilities  of  misclassification  in  the  regimes  described. 
We  comment  on  the  relative  accuracy  of  the  algorithms 
below. 

B.2  Dependence  of  Accuracy  on  Topology. 

We  performed  1000  model-based  simulations  using 
randomly  generated  24-node  trees  with  given  maximum 
branching  ratios  2  and  4.  Link  loss  rates  were  chosen 
at  random  in  the  interval  [1%,10%].  Figure  14  shows 
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Fig.  8.  £  =  1.0%. 


Fig.  9.  £  =  2.0%. 
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Fig.  10.  £  =  3.0%. 


Fig.  11.  £  =  5.0%. 
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Fig.  12.  e  =  5%. 


Fig.  13.  e  =  7%. 


the  probability  of  successful  classification  for  BLTP(£), 
BLTC(£)  and  GLT(£)  for  e  =  0.25%.  In  both  cases 
this  grows  to  1,  but  convergence  is  slower  for  trees  with 
higher  branching  ratios.  We  believe  this  behavior  occurs 
due  to  the  larger  number  of  comparisons  of  values  of  B 
that  are  made  for  trees  with  higher  branching  ratio,  each 
such  comparison  affording  an  opportunity  for  misclassifi- 
cation. 

B.3  Comparison  of  Grouping  Algorithm  Accuracy. 

In  all  experiments  reported  so  far,  with  one  exception, 
the  accuracies  of  BLTP  and  GLT  were  similar,  and  at 
least  as  good  as  that  of  BLTC.  The  similar  behavior  of 
BLTP  and  GLT  is  explained  by  observing  that  the  two 
algorithms  group  nodes  in  a  similar  manner.  In  BLTP, 
a  link  is  pruned  from  the  reconstructed  binary  tree  if  its 
inferred  loss  rate  is  smaller  than  e.  In  GLT,  a  node  is 
added  to  a  group  if  the  estimated  common  loss  of  the  aug¬ 
mented  group  is  within  e  of  the  estimated  common  loss  of 
the  original  group.  The  operation  of  BLTC  is  somewhat 
different,  checking  all  possible  pairs  amongst  candidate 
nodes  for  grouping.  Incorrect  ordering  in  any  test  can  re¬ 
sult  in  false  exclusion  from  a  sibling  set.  We  observe  also 
that  the  performance  gap  between  BLTC  and  the  other 
algorithms  is  sensitive  to  the  value  of  e  and  to  the  branch¬ 
ing  ratio.  The  exceptional  case  in  which  BLTC  performs 
better  than  the  other  algorithms  is  in  the  inference  of  bi¬ 
nary  trees:  here  BLTC  performs  slightly  better  because  of 
the  stricter  grouping  condition  is  employs,  making  it  less 
likely  to  group  more  than  two  nodes. 

B.4  Computational  Complexity  of  Grouping  Algorithms. 

Of  the  two  best  performing  grouping  algorithms, 
namely  BLTP  and  GLT,  we  observe  that  BLTP  has 
smaller  computational  complexity  for  several  reasons. 
First,  B  is  given  explicitly  for  binary  groups,  whereas 


generally  it  requires  numerical  root  finding.  Second, 
although  the  algorithms  have  to  calculate  B  for  up  to 
0{#R^)  groups,  in  typical  cases  GLT  requires  additional 
calculations  due  to  the  larger  sibling  groups  considered. 
Thirdly,  observe  that  each  increase  of  the  size  of  sets  con¬ 
sidered  in  GLT  is  functionally  equivalent  to  one  pruning 
phase  in  BLTP.  Thus  in  GLT,  the  threshold  e  is  applied 
throughout  the  algorithm;  in  BLTP  it  is  applied  only  at 
the  end.  We  expect  this  to  facilitate  adaptive  selection  of 
£  in  BLTP.  Comparing  now  with  BLTC,  we  observe  that 
this  algorithm  requires,  in  addition  to  the  calculation  of 
shared  losses,  the  computation  of  a  maximal  connected 
subgraph,  an  operation  that  does  not  scale  well  for  large 
numbers  of  nodes.  For  these  reasons  we  adopt  BLTP 
as  our  reference  grouping  algorithm  since  it  is  the  sim¬ 
plest  and  has  close  to  the  best  accuracy.  In  the  next  sec¬ 
tion,  we  compare  its  performance  with  that  of  the  ML  and 
Bayesian  classifiers. 

C.  Comparison  o/BLTP  with  the  ML  and  Bayesian  Clas¬ 
sifiers 

C.l  Complexity. 

In  this  section  we  compare  our  reference  grouping  al¬ 
gorithm,  BLTP,  with  the  ML  and  Bayesian  classifiers. 
Here  we  consider  the  simplest  implementation  of  these 
classifiers  whereby  we  proceed  by  exhaustive  search  of 
the  set  T (i?)  of  all  possible  topologies  during  evaluation 
of  the  maxima  (7)  and  (13).  By  contrast,  all  the  grouping 
algorithms  proceed  by  eliminating  subsets  of  T(i?)  from 
consideration;  once  a  set  of  nodes  is  grouped,  then  only 
topologies  which  have  those  nodes  as  siblings  are  consid¬ 
ered. 

The  Bayesian  classifier  further  requires  numerical  inte¬ 
gration  for  each  candidate  topology.  In  order  to  reduce  its 
complexity  we  took  the  prior  for  the  link  rates  to  be  uni¬ 
form  on  the  discrete  set  {1%, . . . ,  10%},  with  all  topolo- 
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Fig.  15.  ML  AND  Bayesian  Classifier:  The  four  possible  topologies  with  three  receivers. 


gies  equally  likely;  we  also  precomputed  the  joint  dis¬ 
tributions  /(x|r,  a).  Due  to  these  computational  costs, 
we  were  able  to  compare  BLTP  with  the  ML  classifier 
for  only  up  to  five  receivers,  and  restricted  the  Bayesian 
classifier  to  the  smallest  non-trivial  case,  that  of  three  re¬ 
ceivers.  The  four  possible  three-receiver  trees  are  shown 
in  Figure  15.  In  this  case,  the  execution  time  of  the 
Bayesian  classifier  was  one  order  of  magnitude  longer 
than  that  of  the  ML  classifier,  and  about  two  orders  of 
magnitude  longer  than  that  of  BLTP. 

C.2  Relative  Accuracy. 

We  conducted  10,000  simulations  with  the  loss  tree 
(r,  a)  selected  randomly  according  to  the  uniform  prior. 
As  remarked  in  Section  VI,  the  Bayesian  Classifier  is,  by 
definition,  optimal  in  this  setting.  This  is  seen  to  be  the 
case  in  Figure  16,  where  we  plot  the  fraction  of  experi¬ 
ments  in  which  the  topology  was  incorrectly  identified  as 
function  of  the  number  of  probes,  for  the  different  clas¬ 
sifiers  (for  clarity  we  plot  separately  the  curves  for  the 
ML  and  BLTP(£)  classifiers).  Accuracy  of  BLTP  greatly 
varies  with  e:  it  gets  close  to  optimal  for  the  intermedi¬ 
ate  value  of  £  =  0.5%,  but  rapidly  decreases  otherwise 
as  £  approaches  either  0  or  the  smallest  internal  link  loss 


rate.  It  is  interesting  to  observe  that  the  ML  classifier  fails 
25%  of  the  time.  This  occurs  when  r  is  the  non-binary 
tree  at  the  left  in  Figure  15.  The  reason  is  that  the  like¬ 
lihood  function  is  invariant  under  the  insertion  of  links 
with  zero  loss.  Statistical  fluctuations  present  with  finitely 
many  probes  lead  to  tree  with  highest  likelihood  to  be  a 
binary  tree  obtained  by  insertion  of  links  with  near-zero 
loss.  This  behavior  does  not  contradict  the  consistency 
property  of  the  ML  classifier  in  Theorem  8;  if  links  with 
loss  less  than  some  £  >  0  are  excluded  from  considera¬ 
tion,  then  for  sufficiently  large  number  of  probes,  the  spu¬ 
rious  insertion  of  links  will  not  occur. 

The  effect  of  these  insertions  can  be  suppressed  by 
pruning  after  ML  classification.  Setting  ML(£)  = 

TP(£)  o  ml  we  find  the  accuracy  almost  identical  with 
that  of  BLTP(£);  this  is  plotted  in  Figure  16(b).  A  more 
detailed  inspection  of  the  experiments  shows  that  BLTP 
selects  the  maximum  likelihood  topology  most  of  the 
time. 

In  practice  we  want  to  classify  a  fixed  but  unknown 
topology.  In  this  context  the  uniform  prior  specifies  a 
pseudo-Bayesian  classifier,  as  in  Section  VI.  Note  that 
this  classifier  is  not  necessarily  optimal  for  a  fixed  topol¬ 
ogy.  We  conducted  a  number  of  experiments  of  10,000 
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Fig.  16.  Misclassification  in  ml,  Bayesian  and  BLT  classifier:  (t,  a)  randomly  drawn  according  to  the  prior  distribu¬ 
tion.  (a)  Bayes  and  BLTP(e)  classifier,  (b)  Bayes  and  ML  classifiers. 
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Fig.  17.  Misclassification  IN  ML,  Bayesian  and  BLT  classifier:  fixed  (t,  a).  Left:  fraction  of  misclassified  topologies.  Right: 
Comparison  of  experimental  and  approximated  tail  slopes. 


simulations  of  the  three  algorithms  with  fixed  loss  trees. 
The  relative  accuracy  of  the  algorithms  was  found  to  vary 
with  both  topology  and  link  loss  rates.  However,  in  all  ex¬ 
ample  we  found  a  value  of  e  for  which  BLTP(£)  accuracy 
either  closely  approached  or  exceeded  that  of  the  ML  and 
Bayesian  classifiers.  As  an  example,  in  Figure  17  we  plot 
the  results  for  the  first  binary  tree  topology  in  Figure  15 
with  all  loss  rates  equal  to  10%  but  that  of  the  sole  inter¬ 
nal  link,  which  has  loss  rate  1%.  In  this  example,  the  ML 
classifier  is  more  accurate  than  the  pseudo-Bayesian  clas¬ 
sifier.  BLTP(£)  accuracy  improves  as  e  is  decreased,  and 
eventually,  for  e  =  0.25%,  it  exceeds  that  of  the  pseudo- 
Bayesian  and  ML  classifier. 

These  experimental  results  are  supported  by  approxi¬ 
mations  to  the  tail  slopes  of  the  log  misclassification  prob¬ 
abilities,  as  detailed  in  Section  VIII.  For  the  same  exam¬ 
ple,  we  display  in  Figure  17  (right),  the  estimated  experi¬ 
mental  and  numerical  approximated  tail  slopes  of  the  ML 


and  BLTP  classifiers.  For  a  given  classifier  these  agree 
within  about  25%.  Finally,  not  reported  in  the  Figure,  we 
also  verified  that  the  ML(£)  classifiers  provide  the  same 
accuracy  as  BLTP(£). 

D.  Summary. 

Whereas  the  Bayesian  classifier  is  optimal  in  the  con¬ 
text  of  a  random  topology  with  known  prior  distribu¬ 
tion,  similar  accuracy  can  be  achieved  using  BLTP(£)  or 
ML(£)  with  an  appropriately  chosen  threshold  e.  In  fixed 
topologies,  the  corresponding  pseudo-Bayes  classifier  is 
not  necessarily  optimal.  In  the  fixed  topologies  for  which 
we  were  able  to  make  comparisons,  better  accuracy  could 
be  obtained  using  BLTP(£)  or  ML(£)  with  an  appropriate 
threshold  e.  The  accuracy  of  BLTP(£)  and  ML(£)  are  sim¬ 
ilar:  most  of  the  time  BLTP  selects  the  ML  topology  with 
maximum  likelihood. 

BLTP  has  the  lowest  complexity,  primarily  because 
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each  grouping  operation  excludes  subsets  of  candidate 
topologies  from  further  consideration.  By  contrast,  the 
ML  and  Bayesian  classifiers  used  exhaustive  searches 
through  the  space  of  possible  topologies.  Since  the  num¬ 
ber  of  possible  topologies  grows  rapidly  with  the  num¬ 
ber  of  receivers,  these  methods  have  high  complexity.  A 
more  sophisticated  search  strategy  could  reduce  complex¬ 
ity  for  these  classifiers,  but  we  expect  this  to  be  effective 
only  if  the  number  of  topologies  to  be  searched  is  reduced 
(e.g.  in  the  manner  of  BLTP).  With  larger  numbers  of  re¬ 
ceivers,  any  fixed  reduction  in  the  per-topology  compu¬ 
tational  complexity  would  eventually  be  swamped  due  to 
the  growth  in  the  number  of  possible  topologies. 


VIII.  Misgrouping  and  Misclassification 


to  consider  the  following  more  general  expression 


B{Si,S2)  := 


Er=i 

nEli(Vjes,Yj^^)-(Vjes.Yj^^) 


_ 7(‘5'i)7(‘5'2) _ 

7(5i) +7(^2)- 7(^1  U  52) 


(14) 


where  5i  and  S2  are  two  non  empty  disjoint  subsets  of 
H-j-.  Analogous  to  Theorem  4,  lim„_>(x>  .B(5i,  52)  = 
i?(5i,  52),  where 


B(5i,52)  := 


=  l]P[VjGS2-^j  =  1] 

=  1] 

_ l{Sih{S2) _ 

7(^1) +7(^2)- 7(^1  U  52)-  ^ 


In  this  section,  we  analyze  more  closely  the  modes  of 
failure  of  BLTP,  and  estimate  convergence  rates  of  the 
probability  of  correct  classification.  Since  this  classifier 
proceeds  by  recursively  grouping  receivers,  we  can  ana¬ 
lyze  topology  misclassification  by  looking  at  how  sets  of 
receivers  can  be  misgrouped  in  the  estimated  topology  T. 
We  formalize  the  notion  of  correct  receiver  grouping  as 
follows.  Rj-  will  denote  the  set  of  receivers  in  the  logical 
multicast  topology  T. 

Definition  1:  Let  (T,  a)  be  a  loss  tree  with  T  = 
(y,  L),  and  let  (T,  a)  be  an  inferred  loss  tree  with  T  = 
{V,L).  The  receivers  i?r(*)  descended  from  a  node 
i  G  fV  are  said  to  be  correctly  grouped  in  T  if  there  exists 
a  node?  G  V  such  that  R-rii)  =  Rf(t)-  In  this  case  we 
shall  say  also  that  node  i  is  correctly  classified  in  T. 

Observe  that  we  allow  the  trees  rooted  at  i  and  T  to  be 
different  in  the  above  definition;  we  only  require  the  two 
sets  of  receivers  to  be  equal. 

Correct  receiver  grouping  and  correct  topology  classifi¬ 
cation  are  related;  in  the  case  of  binary  trees,  the  topology 
is  correctly  classified  if  and  only  if  every  node  k  €  W  is 
correctly  classified.  This  allows  us  to  study  topology  mis¬ 
classification  by  looking  at  receiver  misgrouping.  To  this 
end,  we  need  to  first  introduce  a  more  general  form  of  the 
function  i?(-)  to  take  into  account  expressions  which  may 
arise  as  result  of  classification  errors.  Observe  that  in  (6) 
for  fc  G  y  we  defined  as  =  '^jed{k)Yj^^  = 
jeRT{k)Yj^^  ■  In  line  9  of  BLT  we  have  for  the  newly 

formed  node  [/,  Yj^*^  =  =  VjgsY^*'*\  for  some 

subset  5  of  Rf.  By  construction  5  is  the  set  of  receivers 
of  the  subtree  of  T  rooted  in  U  (which  has  been  obtained 
by  recursively  grouping  the  nodes  in  5).  It  is  clear  that 
5  =  i?r(^)  for  some  node  fc  G  y  if  the  subtree  has  been 
correctly  reconstructed,  but,  upon  an  error,  can  be  other¬ 
wise  a  generic  subset  of  R-j-.  Therefore,  in  BLT  we  need 


(15)  can  be  regarded  as  a  generalization  of  (5)  where  we 
consider  a  pair  of  disjoint  sets  of  receivers  instead  of  pair 
of  nodes. 

A.  Misgrouping  and  Misclassification  in  BLT 

We  start  by  studying  misgrouping  in  binary  trees  un¬ 
der  BLT.  Consider  the  event  Gi  that  BLT  correctly  groups 
nodes  in  R-rii)  for  some  i  G  W.  This  happens  if  group¬ 
ing  operations  do  not  pair  any  nodes  formed  by  recursive 
grouping  R-r{i),  with  any  nodes  formed  similarly  from 
the  complement  Rj-  \  R-r{i),  until  no  candidate  pairs  in 
RtH)  remain  to  be  grouped. 

Lemma  1:  A  sufficient  condition  for  correct  grouping 
of  i  is  that 

5(5i,52,53)  :=.B(5i,53)-B(5i,52)  >0  (16) 

for  all  (5i,52,53)  G  S{i)  =  {(5i,52,53)  :  5i,52  C 
C  R-r  \  RT{i),Sk  ^  9,  k  =  1,2, 3,  5*  ^ 

Si,kj^£}. 

Therefore  Gi  D  =  r\(^Si,S2,S3)eSii)QiSi,  S2,  S3) 
where  <3(5i,52,53)  denotes  the  event  that  (16)  holds. 
This  provides  the  following  upper  bound  for  probability 
of  misgrouping  i,  denoted  by 

Pl:=P[Gt]<  PlQ^iSi,  82,83)]  (17) 

(Si.S2.S3)e>s(i) 

A.  1  Estimation  of  Misclassification  Probabilities. 

We  now  consider  the  asymptotic  behavior  of  Pf ,  first 
for  large  n,  then  for  small  loss  probabilities  a  =  1  —  a. 
Let  8{k)  :=  Ei-^k  k  £V,  and  set  £>(•)  =  E[il(-)]. 

Theorem  10:  Let  (T,  a)  be  a  canonical  loss  tree.  For 
each  i  G  ly,  ■  (5(5i,52,53)  -  i7(5i, 52, 53)), 
{81,82,83)  G  S{i),  converges  in  distribution,  as  the 
number  of  probes  n  — >•  00,  to  a  Gaussian  random 
variable  with  mean  0  and  variance  {Si,  82,  S3),  with 
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D{Si,  S2,  Ss)  =  B{Si,S3)  —  B{Si,S2)-  Moreover,  as 
II  a  II  =maxfcgyafc  — >•  0,  then; 

(i)  D{Si,  82,83)  =  s{a{8i  U  52))  -  s(a(5i  U  53))  + 

o(ll«ll^); 

(ii)  a\8i,82,83)  =  s(a(5i  U  -  s(a(5i  U  53))  + 

o(ll«ll^); 

(Hi) 


min 

iSi,S2,S3)eS{i) 


D^{8,,82,83) 

<7^81,82,83) 


=  ai  +0(||a||^),  (18) 


where,  for  small  enough  ||a||,  the  minimum  is  attained  for 
81,82,83  such  that  a{8i  U  82)  =  i  and  a(5i  U  83)  = 

f{i)- 

Theorem  10  suggests  we  approximate  P[(3‘^(5'i ,  82,  £'3)] 

by  ^  ■  alsi  ’s^’ssj')’  where  is  the  cdf  of  the 

standard  normal  distribution.  Thus  for  large  n  and  small 
||a||.  Theorem  10  and  (17)  together  suggest  that  we  ap¬ 
proximate  the  misgrouping  probability 

(19) 

Here  we  have  used  the  fact  that  the  Pf  should  be  dom¬ 
inated  by  the  summand  with  the  smallest  (negative)  ex¬ 
ponent  according  to  (18).  Thus,  asymptotically  for  many 
probes,  the  probability  of  correctly  identifying  a  group 
of  receivers  descended  from  node  i  is  determined  by  the 
loss  rate  of  link  i  alone,  and  is  larger  for  lossier  links. 
Moreover,  the  stated  relations  between  the  minimizing 
(£1, 82,83)  in  (hi)  say  that  the  likely  mode  of  failure  is 
to  mistakenly  group  a  child  of  i  with  the  sibling  of  i. 

In  binary  trees,  the  topology  is  correctly  classified 
when  all  groups  are  correctly  formed.  Hence  P^lt  — 

Eierv  Pi  ~  maxierv  Pj ,  and  we  expect  log  P^lt  ^e 
an  asymptotically  linear  with  function  of  n  with  negative 
slope  a'^/2,  where 

=  min  oii.  (20) 

iew 

Thus,  in  the  regime  considered,  the  most  likely  way 
to  misclassify  a  tree  is  by  incorrectly  grouping  siblings 
whose  parent  node  j  terminates  the  least  lossy  internal 
link,  mistakenly  grouping  the  sibling  of  j  with  one  of  its 
children. 

We  remark  that  the  preceding  argument  can  be  formal¬ 
ized  using  Large  Deviation  theory  [5].  However,  calcula¬ 
tion  of  the  decay  rate  appears  computationally  infeasible, 
although  one  can  recover  the  leading  exponent  (2  in 
the  small  ||a||  regime. 

A.2  Experimental  Evaluation. 

Although  we  have  derived  the  slope  through  a  series 
of  approximations,  we  find  that  it  describes  experimen¬ 
tal  misclassification  and  misgrouping  reasonably  well. 


We  performed  10,000  experiments  with  an  eight-leaf  per¬ 
fectly  balanced  binary  tree.  On  each  experiment,  the  loss 
rates  are  a  random  permutation  of  the  elements  of  the  set 
{0.5%,  1%, . . . ,  7%,  7.5%}.  In  this  way,  the  smallest  loss 
rate  is  fixed  to  0.5%.  In  Eigure  18  we  plot  the  propor¬ 
tion  of  links,  that  had  loss  rates  greater  than  or  equal  to  a 
given  threshold  (p,  and  were  misclassified.  As  the  number 
of  probes  increases,  misclassification  is  due  exclusively 
to  misgrouping  of  low  loss  rate  links:  in  this  set  of  experi¬ 
ments,  no  link  with  loss  rate  higher  than  2%  was  misclas¬ 
sified  once  the  number  of  probes  exceeded  700. 

According  to  (19),  the  different  curves  should  be 
asymptotically  linear  with  negative  slope  approximately 
^/2  (then  adjusted  by  a  factor  log^o  e  since  the  logarithms 
are  to  base  10).  On  the  table  in  Eigure  18(right)  we  dis¬ 
play  the  estimated  experimental  and  approximated  slopes. 
Agreement  is  good  for  (p  =  2.5%  and  5%.  We  believe 
the  greater  error  for  (p  =  7.5%  may  be  due  to  the  de¬ 
parture  from  the  leading  order  linear  approximations  of 
(18)  for  larger  values  of  a*,;  also  relatively  few  points  are 
available  for  estimation  from  the  experimental  curves.  In 
the  figure,  we  also  plot  the  log  fraction  of  times  BLT  cor¬ 
rectly  identify  the  topology;  as  expected,  this  curve  ex¬ 
hibits  the  same  asymptotic  linear  slope  of  the  fraction  of 
misgrouped  links,  i.e.,  the  one  for  (p  =  0%. 

B.  Misgrouping  and  Misclassification  in  BLTP(£) 

We  turn  our  attention  to  the  errors  in  classifying  gen¬ 
eral  trees  by  the  reference  algorithm  BLTP(£).  In  the  fol¬ 
lowing,  without  loss  of  generality,  we  will  study  the  er¬ 
rors  in  the  classification  of  the  pruned  tree  {T^,a^)  = 
TP(£)(T,  a),  with  =  (y*^,  under  the  assumption 
that  £  ^  Uk,  k  £  W.  This  will  include,  as  a  special  case, 
when  £  is  smaller  than  the  internal  link  loss  rates  of  the 
underlying  tree,  i.e.,  when  =  T,  the  analysis  of  the 
misclassification  of  T.  =  y*^  \  ({0, 1}  U  i?r')  will 
denote  the  set  of  nodes  in  terminating  internal  links. 

Let  (T,  a)  denote  the  tree  produced  by  BLT,  the  final 
estimate  is  obtained  from  T  by  pruning  links  whose 
loss  rate  is  smaller  or  equal  than  £,  i.e.,  ,a^)  = 

TP(£)(r,s).  In  contrast  to  the  binary  case,  incorrect 
grouping  by  BLT  may  be  sufficient  but  not  necessary  for 
misclassification.  Eor  BLTP(£),  incorrect  classification 
occurs  if  any  of  the  following  hold; 

(i)  at  least  one  node  in  is  misclassified  in 

(ii)  TP(£)  prunes  links  from  T  that  are  present  in  or 
(Hi)  TP(£)  fails  to  prune  links  from  T  that  are  not  present 
in  . 

Observe  that  (i)  implies  that  a  node  i  such  that  <  £  can 
be  misclassified  and  still  provided  the  all  the 

resulting  erroneous  links  are  pruned. 

We  have  approximated  the  probability  of  errors  of  type 
(i)  in  our  analysis  of  BLT.  Errors  of  type  (ii)  are  excluded 
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Fig.  18.  Misclassification  AND  Misgrouping  IN  BIT.  Left:  fraction  of  links  misclassified  with  loss  >  </),  for  </)  =  0%,  2.5%,  5.0%. 7.5%. 
Right:  Comparison  of  experimental  and  approximated  tail  slopes. 


if  for  all  i  G  W^: 

D{Si,S2,S3,e)  :=B(5iU52,53)(1-£)-B(5i,52)  >0 

(21) 

for  all  {Si,  82,  S3)  G  S{i),  since  this  condition  implies 
that  all  estimated  loss  rates  of  links  in  the  actual  tree 
are  greater  than  e.  Errors  of  type  (iii)  are  excluded  if 
B{Si,S3)  -  B{Si,S2)  >  0  and  B{Si  U  52,53)(1  - 
£)  -B{Si,S2)  <  0,  orif  B(5i^53)  -  B{Si,S2)  <  0 
and  B{Si  U  53,52)(1  —  £)  —  B{Si,S3)  >  0  for  all 
(51,^2,  53)  G  <5(£)  where  5(£)  =  {(5i,52,53)  :  Sj  C 
R;  Sj  ^  0;  Sj  (15*  =  9,j  ^  k;  {Si  U  ^2  U 
S3)  n  i?r(*)  =  0  V  (5i  u  52  u  53)  c  Rr{i)  v  3j  g 
{1,  2, 3}  C  Sj,i  G  W^}.  The  latter  conditions 

ensure  that  all  the  links  in  the  binary  tree  produced  by 
BLT,  which  are  either  results  of  node  misgrouping  or  cor¬ 
responding  to  fictitious  links  due  to  binary  reconstruc¬ 
tion,  have  estimated  loss  rate  less  than  or  equal  to  e,  and 
are  hence  pruned.  Summarizing,  let  Q{Si,  S2, 83,8)  be 
the  event  that  (21)  holds  for  a  given  (5i,52,53),  and 
G{e)  the  event  that  the  topology  is  correctly  classified. 
Then  G{e)  3  n  (3^*^(£))  n 

where  Q^^'\s)  =  ri(^Si,S2,S3)eSik)Q{Si,  82, 83,6) 

and  =  n(5j_52_53)g5(g)((3(5i,52,53)  n 

Q(5i,  52, 53,  £)«)U(Q(5i,  52, 53)‘=nQ(5i,  53, 52,  £)«). 

Consequently,  we  can  write  a  union  bound  for  the  proba¬ 
bility  of  misclassification; 

■^BLTP(e)  •“  (22) 

kew‘ 

and  each  term  in  (22)  can  in  turn  be  bounded  above  by 
a  sum  similar  to  the  RHS  of  (17).  For  the  last  term,  in 
particular,  observe  that 

Q(™)(£)c  =  LI(^Si,S2,S3)eS{e){Q{Si,S2,S3)‘'LI  (23) 


Q  (5i ,  52 , 53 ,  £) )  n  (Q  (5i ,  52 , 53 )  n  Q  (5i ,  53 , 52 ,  £) ) 

^  '^{Si,S2,S3)eS{E){Q{Si,  S2,  S3,  s)  U  Q{Si,S3,S2,£)) 
,S2,S3)ES{e)Q{Sl ,  82 ,  S3 ,  £) , 

sothatP[Q(™)(£)^]  <  E(Si.S2.S3)s5(e)<3(‘5'i>‘S'2,53,£). 

B.l  Misclassification  Probabilities  and  Experiment  Du¬ 
ration. 

We  examine  the  asymptotics  of  the  misclassification 
probability  for  large  n  and  small  ||a||,  by  the 

same  means  as  in  Section  VIII-A.  This  amounts  to  find¬ 
ing  the  mean  D{Si,S2,S3,e)  and  asymptotic  variance 
ct^(5i,  52, 53,  £)  of  the  distribution  of  D{Si,  82, 83,5), 
then  finding  the  dominant  exponent  over  the  var¬ 

ious  (5i,52,53).  Let  {e)  =  min^g^t  denote  the 
smallest  internal  link  loss  rate  of  larger  than  £  and 
W {e)  =  maxi^w\W‘  the  largest  internal  link  loss  rate 
of  T  smaller  than  e  or  a^{e)  =  0  if  no  such  loss  rate  ex¬ 
ists  (which  occurs  when  e  is  smaller  than  all  internal  links 
loss  rate).  The  proof  of  the  following  result  is  similar  to 
that  of  Theorem  10  and  is  omitted. 

Theorem  11:  Let  {T,a)  be  a  canonical  loss  tree.  For 
each  0  <  £  <  1,  (5i,52,53)  G  \Ji^w‘S{i)  U 
5(£),  (5(5i,52,53,£)  -  i?(5i,52,53,£))  con¬ 

verges  in  distribution,  as  the  number  of  probes  n  — >• 
00,  to  a  Gaussian  random  variable  with  mean  0  and 
variance  ct^(5i,  52, 53,  £).  Furthermore,  as  ||a||  = 
maxj,gy  — >•  0  and£/||a||  — >•  c  G  (0,  00), 

(/)  D{Si,S2,  53, £)  =  s{a{Si  U  52))  -  s{a{Si  U  53))  - 

£  +  0(||a|r); 

(ii)  {Si,  82,83,5)  =  |s(a(5iU52))-s(a(5iU53))|-l- 

0{\\af); 
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(Hi)  If  (^i, 52,53)  G  5(i),  i  G  W^, 


1e+00 


mm 


i?2(5i,52,53,£)  _  (ai-ey 


Oii 


(Si,S2,Ss)eS(i)  ct2(5i,52,53,£) 
and 

i?2(5l,52,53,£) 


mm 


mm 


iew‘  (SuS2,Ss)eS(i)  u‘^{Si,  82,83,8) 

{a^  (s)  —  £)^ 


+0{\\af) 

(24) 

(25) 


a^{e) 


+o(ii«ir)- 


If(5i,52,53)  g5(£), 


min 

(Si.S2.S3)e5(e) 


i?2(5l,52,53,£) 

a2(5i,52,53,£) 


(26) 


J  0(£^/||a||  )  ifa^(£)=0 

+0(ll«ll')  ifa^(£)>0 

In  (27)  above,  for  clarity  we  distinguish  the  expressions 
for  a’^{e)  =  0  and  a^{e)  >  0.  Observe  that  the  result  for 
a^{e)  =  0  in  (27)  can  be  actually  obtained  by  taking  the 
limit  of  the  expression  fora^(£)  >  0,  which  is  of  the  form 

((£  -  a^(£))2  +  0(||  a  ll^))  /  (c?’(£)  +  0(||a  ll^)) . 


Using  the  same  reasoning  as  was  used  in  Section  VIII- 
A,  we  expect  that  the  logarithms  of  the  probabilities  of 
errors  of  type  (i),  (ii)  and  (iii)  to  be  asymptotically  lin¬ 
ear  in  the  number  of  probes  n,  with  slopes  that  behave 
respectively  as 


cW=#(£)/2,  = 


0(£Vl|a|r) 


2aP(e} 


(a^is)  —  £)^ 
2a/(£)  ' 

ifc?’(£)  =  0 
ifc?’(£)  >  0 


(27) 


The  dominant  mode  of  misclassification  is  that  with  the 
lowest  slope  in  (27),  which  then  dominates  the  sum  in 
(22)  for  large  n.  Hence  we  approximate  the  misclassifi¬ 
cation  probability  to  leading  exponential  order  by 


p/  ^  — (n/2) 

BLTP(e)  ~  ^ 


(28) 


Since  type  (ii)  errors  always  dominate  type 

(i).  Between  type  (ii)  and  (iii),  the  prevailing  type  of  er¬ 
rors  depends  on  the  relative  magnitude  of  (e),  a^{e) 
and  £,  which  satisfy  cF  (£)  <  e  <  (e) .  Type  (ii)  be¬ 
comes  prevalent  as  £  — >•  a.^  [e)  since  then  — >•  0;  sim¬ 

ilarly,  type  (iii)  dominates  as  £  — >•  cF(£).  Thus,  £  should 
be  chosen  large  enough  to  avoid  the  type  (iii)  errors,  but 
small  enough  so  that  the  probability  of  type  (ii)  does  not 
become  large.  Unfortunately,  this  is  not  possible  unless 
information  on  the  actual  link  loss  rates  is  available.  We 
believe,  nevertheless,  that  this  does  not  represent  a  prob¬ 
lem  in  practice.  Indeed,  as  the  analysis  above  indicates. 
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Fig.  19.  Misclassification  and  Misgrouping  in  BLTP(e): 
(a)  fraction  of  misclassified  links  with  loss  >  </),  for  (p  = 

0%,  2.5%,  5.0%. 7. 5%  ;  (b)  fraction  of  misclassified  trees  for  e  = 
0.1%,  0.2%,  0.3%,  0.4%. 


for  enough  large  n,  the  most  likely  way  BLTP(£)  misclas- 
sify  a  tree  is  by  either  pruning  the  link  which  the  least  loss 
rate  higher  than  e  (a  type  (ii)  error)  or  by  not  pruning  that 
with  the  the  largest  loss  rate  smaller  than  e  (a  type  (iii) 
error);  either  way,  the  resulting  inferred  tree  would  differ 
from  the  actual  by  the  at  most  one  link,  approximatively, 
that  with  the  loss  rate  closest  to  e. 

The  foregoing  arguments  allow  us  to  also  estimate  the 
number  of  probes  N  required  for  inference  with  misclas¬ 
sification  probability  5  in  a  tree  with  minimum  link  loss 
rate  .  This  is  done  by  inverting  the  approximation  (28) 
to  obtain  that  N  is  approximately 


2af  (e)  log  S 
(e-af(e)P 

-2  max 


«£(£) 


(e— af(e))^’  (e— cp’(e))^ 


I  log  5 


ifcF(£)  =  0 

ifcF(£)  >  0 
(29) 


Note  that  for  BLT,  or  when  e  <C  ,  this  reduces  to  the 
simple  form  N  «  —2  log(()) la^. 
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We  conclude  by  observing  that  in  the  above  analysis, 
we  have  implicitly  assumed  that  ^  0.  Nevertheless, 
for  large  enough  £,  =  0  which  corresponds  to  the 

case  when  7~^  is  a  degenerate  tree  where  all  leaf  nodes 
are  siblings.  In  this  case,  it  is  clear  that  misclassification 
occurs  only  because  of  type  (iii)  errors.  The  misclassifi¬ 
cation  analysis  for  this  special  case  can  then  be  obtained 
by  taking  into  account  type  (iii)  errors  alone. 

B.2  Experimental  Evaluation. 

We  performed  10,000  experiments  in  a  21  node  tree 
with  mixed  branching  ratio  2  and  3.  On  each  experiment, 
the  loss  rates  are  a  random  permutation  of  the  elements 
of  the  set  {0.5%,  1%, . . . ,  9.5%,  10%},  thus  having  the 
same  smallest  link  loss  as  in  the  experiments  for  BLT.  In 
Eigure  19  we  plot  the  fraction  of  links,  that  had  loss  rates 
greater  than  or  equal  to  a  given  threshold  (p,  and  were  mis- 
classified.  These  appear  very  similar  to  those  for  BLT  in 
Eigure  18.  In  Eigure  19(b)  we  also  plot  the  fraction  of 
misclassified  trees  using  BLTP(£)  for  different  values  of 
£,  all  smaller  than  the  smallest  loss  rate  of  0.5%.  With  this 
choice,  a^(£)  =  0  and  a'^(£)  =  0.5%.  As  expected,  ac¬ 
curacy  is  best  for  intermediate  e.  The  difference  in  shape 
between  the  last  and  the  first  three  curves  indicates  the 
change  between  the  two  different  regimes  of  misclassi¬ 
fication.  Eor  £  smaller  than  0.4%,  misclassification  is 
dominated  by  erroneous  exclusion  of  nodes  from  a  group, 
while  for  e  =  0.4%,  misclassification  is  mostly  deter¬ 
mined  by  erroneous  pruning  of  the  link  with  the  smallest 
loss  rate  (which  is  0.5%)  because  of  statistical  fluctuation 
of  its  inferred  loss  rate  below  e.  In  the  latter  case,  we  can 
use  (27)  to  compute  the  tail  slope  obtaining  4.3  x  10^^, 
in  good  agreement  with  the  estimated  experimental  slope 
which  is  4.1  x  10^^. 

B.3  Asymptotic  Misclassification  Rates  for  the  ML- 
Classifier 

We  sketch  how  the  theory  of  large  deviations  [5] 
can  be  used  to  bound  the  asymptotic  probability  of 
misclassification  by  the  ML  estimator.  The  expres¬ 
sions  obtained  here  were  used  to  determine  the  ML  tail 
slopes  in  the  table  in  Eigure  17.  Eirst,  observe  that 
Pr,a('7ML  T)  =  =  t).  Eor 

T  ^  T,  each  term  in  this  sum  can  be  bounded  above  by 
Pr.a  E"=i  >  0}),  where 

g{x;T,a')  =  log(p(x;  r,  a')/p(x;  T,  a))  and  p(x;T,  a) 
the  probability  of  the  outcome  x  G  fl  =  {0, 1}^  under 
the  loss  tree  (T,  a).  Let  /<„  =  denote  the 

empirical  distribution  of  the  first  n  quantities  (here 
6x  is  the  unit  mass  at  x),  and  for  each  r  and  a'  G  At 
letr^,„,  =  {u  £  Mi(fl)  :  Y.^^a9{x]T,a')v{x)  >  0} 
(here  Mi  (fl)  is  the  set  of  probability  measures  on  fl)  and 
set  r.r  =  ^a' eAT^T,a'  ■  Since  the  ,  T,  ct')  are  IID 


random  variables,  we  can  use  Sanov’s  Theorem  [5]  to 
conclude  that 

limsup  -  log  Pr,a('^L  =  r) 

n—^oo  ^ 

<  limsup  -  log  Pr,a(Mn  G  r^) 

n—^oo  ^ 

<  -  inf  K{p  I  f{-,T,a)).  (30) 

Here,  for  G  Mi(fl),  K{v  \  rj)  = 

is  ths  Kullback-Leibler  “dis¬ 
tance”,  or  entropy  of  p  relative  to  rj.  By  further  minimiz¬ 
ing  the  right-hand  term  of  (30)  over  all  t  ^  T,  we  obtain 
an  asymptotic  upper  bound  for  the  decay  rate  of  the  mis¬ 
classification  probability  as  n  increases.  Eor  each  r,  the 
minimization  can  be  carried  out  using  the  Kuhn-Tucker 
theorem;  we  use  the  form  given  in  [15]. 

We  mention  that  a  lower  bound  of  the  following  form 
can  be  found: 

liminf-logPr,a('7ML  7^  T}  > 

n—^oo  fl 

-inf{if(r',a'|r,a)  :t'  a'  £  At'}  (31) 

IX.  Summary  AND  Conclusions 

In  this  paper  we  have  proposed  and  established  the  con¬ 
sistency  of  a  number  of  algorithms  for  inferring  logical 
multicast  topology  from  end-to-end  multicast  loss  mea¬ 
surements.  The  algorithms  fall  in  two  broad  classes:  the 
grouping  algorithms  (BLTP,  BLTC  and  GLT),  and  the 
global  algorithms  (ML  and  Bayesian). 

The  computational  cost  of  the  grouping  approaches 
is  considerably  less  for  two  reasons:  (i)  they  work  by 
progressively  excluding  subsets  of  candidate  topologies 
from  consideration  while  the  global  algorithms  inspect  all 
topologies;  and  (ii)  their  cost  per  inspection  of  each  poten¬ 
tial  sibling  set  is  lower.  Of  the  grouping  algorithms,  the 
BLTP  approach  of  treating  the  tree  as  binary  then  pruning 
low  loss  links  is  simplest  to  implement  and  execute. 

Of  the  algorithms  presented,  only  the  Bayesian  is  able 
to  identify  links  with  arbitrarily  small  loss  rates.  All  the 
other  classifiers  require  a  parameter  £  >  0  that  acts  as  a 
threshold:  a  link  with  loss  rate  below  this  value  will  be 
ignored  and  its  endpoints  identified.  The  threshold  is  re¬ 
quired  in  order  that  sibling  groups  not  be  separated  due 
to  random  fluctuations  of  the  inferred  loss  rates.  How¬ 
ever,  we  do  not  believe  that  the  necessity  of  a  threshold 
presents  an  obstacle  to  their  use  in  practice,  since  it  is 
the  identification  of  high  loss  links  that  is  more  important 
for  performance  diagnostics.  In  practice  we  expect  £  to 
be  chosen  according  to  an  application-specific  notion  of  a 
minimum  relevant  loss  rate. 

By  construction,  the  Bayesian  classifier  has  the  great¬ 
est  accuracy  in  the  context  of  classification  of  topologies 
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drawn  according  to  a  known  random  distribution.  How¬ 
ever,  the  performance  gap  narrows  when  classifying  a 
fixed  unknown  topology,  and  in  fact  the  Bayesian  clas¬ 
sifier  has  slightly  worse  performance  than  the  others  in 
this  context.  We  conclude  that  BLTP  offers  the  best  per¬ 
formance,  having  the  lowest  computational  cost  for  near 
optimal  performance. 

This  selection  of  BLTP(£)  motivates  analyzing  its  error 
modes,  and  their  probabilities.  Although  the  analysis  is 
quite  complex,  a  simple  picture  emerges  in  the  regime  of 
small  loss  rates  aj.  and  many  probes  n,  and  errors  are 
most  likely  to  occur  when  grouping  the  children  of  the 
node  j  that  terminates  the  link  of  lowest  loss  rate. 

The  leading  exponents  for  the  misclassification  that 
were  calculated  in  Section  VIII  can  be  used  to  derive 
rough  estimates  of  the  number  of  probes  required  in  prac¬ 
tice.  Consider  the  problem  of  classifying  a  general  topol¬ 
ogy  whose  smallest  link  loss  rate  in  1%.  According  to 
(29),  the  number  of  probes  required  for  a  misclassification 
probability  of  1%  (using  £  =  0.5%)  is  about  4000.  (In  a 
binary  topology  using  BLT  the  number  required  drops  to 
about  1000).  Using  small  (40  byte)  probes  at  low  rate  of 
a  few  tens  of  kbits  per  sec,  measurements  involving  this 
many  probes  could  be  completed  within  only  a  few  min¬ 
utes. 

We  note  that  the  grouping  methods  extend  to  a  wider 
class  of  estimators  by  replacing  the  shared  loss  estimate 
with  any  function  on  the  nodes  (i)  that  increases  on  mov¬ 
ing  away  from  the  root;  and  (ii)  whose  value  at  a  node  can 
be  consistently  estimated  from  measurements  at  receivers 
descended  from  that  node.  Examples  of  such  quantities 
include  the  mean  and  variance  of  the  cumulative  delay 
from  the  root  to  a  given  node;  see  [6]  and  [11]. 

Finally,  a  challenging  problem  is  to  take  the  resulting 
logical  multicast  trees  and  mapping  the  constituent  nodes 
onto  physical  routers  within  real  networks.  This  remains 
beyond  our  capability  at  this  time. 

X.  Proofs  of  the  Theorems 

The  proof  of  Proposition  1  depends  in  the  following 
Lemma. 

Lemma  2:  Let  Qi  >  0  for  i  =  1,  2, . . . ,  m;  let  5  be  such 
that  mini  9i  <  9  <  9L  and  set  f{b)  :=  (1  -  g/b)  - 
—  Qi  lb).  Then  the  equation  /(6)  =0  has  a  unique 
solution  b*  >  g.  Furthermore,  given  b  >  g  then  /(6)  >  0 
if  and  only  if  6  >  6* . 

Proof  of  Lemma  2:  Set  q  =  gi! g  <  1  so  that 

>  1-  Let  hi{x)  =  (1  -  x),  /i2(x)  =  ni(l  -  CiX) 
and  h  =  hi  —  /i2,  so  that  /(6)  =  h{g/b).  We  look 
for  zeroes  of  h.  For  x  G  [0,1]  h'l{x)  =  0,  h2{x)  = 

h2{x)^{Yi(li{x)f  >  0  where  q'i(x)  = 

Ci/(1  —  Cix)  >  0.  Hence  h  is  strictly  concave  on  [0, 1]. 
Now  /i(0)  =  0,  h{l)  <  0  and  /i'(0)  =  — 1  -b  Yi  >  0- 


So  since  h  is  concave  and  continuous  on  [0, 1]  there  must 
be  exactly  one  solution  x*  to  h{x)  =  0  for  x  G  (0, 1)  and 
hence  one  solution  b*  to  f{b)  =  0  for  b  >  g.  Further¬ 
more,  given  X  G  (0, 1),  h{x)  >  0  iff  x  <  x*  and  hence 
given  b  >  g,  f{b)  >  0  iff  6  >  6*.  ■ 

Proof  of  Proposition  1 :  Clearly  min^gc/ 7(fc)  <  'y{U)  < 
Ykeu  T(L^)  a  canonical  loss  tree  and  hence  (i)  and  (ii) 
follow  from  Lemma  2.  (iii)  is  then  a  restatement  of  (2), 
established  during  the  proof  of  Prop.  1  in  [3]. 

(iv)  Write  U'  =  U  U  {fc}.  We  refer  to  Figure  1,  where 
we  show  the  logical  multicast  subtree  spanned  by  k,  U 
and  their  descendents,  together  with  a{U),a{U')  and  the 
root  0.  From  (i),  B{U')  is  the  solution  of  the  equation 


and  B{U)  is  the  solution  of 


7(i)  A 

BmJ  ■ 


(32) 


(1  -  7{U)/B{U))  =  n  (1  -  lij)IB{U))  (33) 
jeu 


Now  suppose  that  B{U)  >  B{U').  We  shall  show 
that  this  leads  to  a  contradiction.  Since  then  B{U)  > 
B{U')  >  we  can  apply  (i)  and  (ii)  to  (32)  to  ob¬ 

tain 


1{U') 

B{U) 


> 


juy\ 

BiU)J 


7(fc)  A  f.  _  ijU)  \ 
B{U)J\  B{U)J’ 


(34) 


with  the  right-hand  equality  obtained  by  substitution  of 
(33).  Applying  (2)  at  the  node  a{U')  we  have 


1  _  7(fc)  A  _  _m_\ 

A{a{U'))  \  A{a{U')J\  A{a{U'))  J ' 

(35) 

Since  the  assumption  B{U)  >  B{U')  implies  that 
B{U)  >  7(U'),  then  comparing  (34)  with  (35)  and  using 
(ii)  again  we  find  A{a{U'))  <  B{U)  =  A{a{U)).  This 
is  a  contradiction  since  a{U')  >-  a{U)  and  T  canonical 
implies  A{a{U'))  >  A{a{U)).  ■ 

While  proving  that  DLT  reconstructs  the  tree  correctly, 
we  find  it  useful  to  identify  a  subset  5  of  U  as  a  stra¬ 
tum  if  {R{k)  :  fc  G  5}  is  a  partition  of  R.  If  DLT 
works  correctly,  then  before  each  execution  of  the  while 
loop  at  line  4  of  Figure  2,  the  set  R'  is  a  stratum  and  the 
set  {V ,  L')  of  nodes  and  links  is  consistent  with  the  ac¬ 
tual  tree  (U,  L)  in  the  sense  that  it  decomposes  over  sub¬ 
trees  rooted  at  the  stratum  R' ,  i.e.,  V  =  Ui.eR'V{k)  and 
L'  =  UkeR'  L{k).  This  is  because  any  correct  iteration  of 
the  loop  that  groups  the  children  of  node  k  has  the  effect 
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of  joining  subtrees  rooted  at  nodes  in  d{k),  while  modify¬ 
ing  the  partition  {R{k)  :  k  €  R'}  by  replacing  elements 
{-R(i)  :  j  G  d{k})  by  R{k).  The  proof  of  Theorem  2  de¬ 
pends  on  the  following  Lemma  that  collects  some  proper¬ 
ties  of  strata. 

Lemma  3:  If  5  is  a  stratum  in  a  logical  multicast  tree 
(V,L)  then 

(i)  If  k  €  S  then  no  ancestor  or  descendant  of  k  lies  in  S. 

(ii)  Exactly  one  of  the  following  alternatives  applies  to 
each  non-root  node  kinV\  (a)  fc  G  5;  (b)  fc  has  an  ances¬ 
tor  in  5;  (c)  k  has  at  least  two  descendants  in  S. 

Proof  of  Lemma  3:  (i)  Ifj,  k,€  S  andj  -<  k  then  R{j)  C 
R{k),  contradicting  the  partition  property,  (ii)  If  k  ^  S, 
then  there  exists  j  €  S  obeying  one  of  the  alternatives 
j  >-  k  or  j  -<  k,  for  otherwise  R{j)  does  not  overlap 
with  any  element  of  the  partition  {R{j)  :  j  G  S}.  By 
(i),  the  alternatives  are  exclusive.  There  exists  j  €  S  with 
j  >-  k,  it  is  unique,  by  (i).  If  not,  there  exists  j  €  S 
with  j  -<  k.  In  this  case  k  cannot  be  a  leaf  node  and 
hence  R{j)  C  R{k)  since  k  has  branching  ratio  at  least 
2.  Hence  there  must  be  at  least  one  more  node  j'  €  S 
with  j  -<  k,  since  otherwise  the  partition  {R{j)  :  j  €  S} 
would  not  cover  R.  ■ 

Proof  of  Theorem  2:  (i)  Suppose  that  DLT  yields  an  in¬ 
correct  tree,  and  consider  the  first  execution  of  the  loop 
during  which  (E',  L')  becomes  inconsistent.  Inconsis¬ 
tency  could  occur  for  the  following  reasons  only: 

1.  If  the  minimizing  pair  {ui,  U2}  are  not  siblings.  Then 
there  exists  t  -<  a(ui,  U2)  that  is  the  parent  of  either  ui 
or  U2',  say  t  y  Ui.  Since  ui  G  R',  by  Lemma  3(i)  no 
ancestor  of  Ui  -  including  t  -  can  be  in  R' .  Hence  by 
Lemma  3(ii),  there  must  be  at  least  one  node  u'2  in  addi¬ 
tion  to  ui  with  the  property  that  <  t  and  u'2  G  R'  . 
Since  the  loss  tree  is  canonical,  B(ui,u'2)  <  A{f)  < 
A{a{ui,U2))  =  B{ui,U2),  contradicting  the  minimality 
of  B {ui  ,1*2).  Hence  the  minimizing  pair  are  siblings. 

2.  If  not  all  sibling  nodes  ofui,U2  are  members  of  R' . 
Let  there  be  a  sibling  s  of  ui  that  is  not  in  R' .  Since 
Ui  G  R',  then  by  Lemma  3(i)  no  ancestor  of  Ui  - 
and  hence  no  ancestor  of  its  sibling  s  -  can  lie  in  R' . 
Since  s  itself  is  not  in  R' ,  by  Lemma  3(ii),  there  exist 
si,  S2  G  R'  with  ancestor  s.  Since  the  loss  tree  is  canon¬ 
ical,  B{si,S2)  <  A{s)  <  A{a{ui,U2))  =  B{ui,U2), 
contradicting  the  minimality  of  U2).  Hence  all  sib¬ 
lings  of  Ui,U2  are  members  of  R' . 

3.  If  not  all  sibling  nodes  of  ui,  U2  are  included  in  U  of 
steps  5-7.  This  would  violate  Prop  l(iii). 

A.  If  a  node  that  is  not  a  sibling  of  Ui,  U2  is  included  in 
U .  This  would  violate  Prop  l(iv). 

(ii)  Since  (i)  allows  the  reconstruction  of  the  loss  tree 
from  the  outcome  distribution,  distinct  loss  trees  can  not 
give  rise  to  the  same  outcome  distributions,  and  hence  the 
canonical  loss  tree  is  identifiable.  ■ 


Proof  of  Theorem  3:  Consider  a  maximal  set  U  = 
{ui,  U2, . . . ,  Uto}  of  siblings  that  is  formed  by  execu¬ 
tion  of  the  while  loop  at  line  6  in  DLT;  see  Figure  2. 
We  assume  the  non-trivial  case  that  m  >  2  and  as¬ 
sume  initially  that  U  is  unique.  By  Prop,  l(iii),  B{-)  is 
minimal  within  R^^^  :=  R'  on  any  pair  of  nodes  from 
5(0)  —  u  xjje  action  of  DBLT  can  be  described  iter¬ 
atively  over  i  =  {0, 1, ... ,  m}  as  follows.  After  select¬ 
ing  in  line  5,  all  pairs  in  = 

\  U  minimize  Bf)  over  all  pairs  in 

\  [/(^))  u  with  the  same  min¬ 
imum  B{U).  This  is  because  (1  —  / B{U))  = 

—  'y{u)/B{U))  where  W(£)  denotes  the  mem¬ 
bers  of  U  that  are  descended  from  in  the  binary 
tree  built  by  DBLT.  Hence  (1  —  'y{U^^^)/B{U))  =  (1  — 

'y{u[^^)/B{U)){l  -  -i{u^2^)IB{U))  and  so  = 

B{U)  by  Prop.  l(i). 

Thus  for  each  step  in  DLT  that  groups  the  nodes  in  U , 
there  are  m  —  1  steps  of  DBLT  that  successively  group 
the  same  set  of  nodes.  Since  B{U^^^)  =  B{U)  for  all 
£,  each  node  j  added  in  DBLT  has  aj  =  1,  apart  from 
the  last  one.  Therefore,  TP(0)  acts  to  excise  all  links 
between  the  binary  nodes  —  1.  Thus 

DLT  =  TP(0)  o  DBLT.  If  U  in  not  unique,  the  same 
arguments  apply,  except  now  there  can  be  alternation  of 
grouping  operations  acting  on  different  maximal  sibling 
sets.  ■ 

Proof  of  Theorem  4:  Since  each  7(C/)  is  the  mean  of  n 
independent  random  variables  then  by  the  Strong  Law  of 
Large  Numbers,  7(C/)  converges  to  E[7([/)]  =  '){U)  al¬ 
most  surely  as  n  — >•  00.  In  Theorem  1  of  [3]  it  is  shown 
that  B{U)  is  a  continuous  function  of  {7(a([/)),  {7(fc)  : 
k  G  (7}},  from  which  the  result  follows.  ■ 

Proof  of  Theorem  5:  Let  U  denote  a  generic  binary  sub¬ 
set  of  R'  that  minimizes  Bf)  when  DBLT  is  applied  to 
(T,  a).  Assume  initially  that  the  minimizing  U  is  unique. 
Since  the  loss  tree  is  canonical,  B{U)  <  B{U')  for  any 
other  candidate  binary  set  [/';  by  the  convergence  prop¬ 
erty  of  Theorem  4,  B{U)  <  B{U')  for  all  n  sufficiently 
large.  Hence  the  nodes  in  U  are  grouped  correctly. 

But  it  may  happen,  by  coincidence,  that  the  minimizing 
U  is  not  unique.  Then  there  are  pairs  JjAl , . . . ,  [/(™)  that 
minimize  B.  Since  the  tree  is  canonical,  then  after  each 
G  R'  has  been  grouped,  the  remaining  pairs  are  still 
minimizers  of  Bf)  amongst  all  pairs  of  the  reduced  set 
(i?'  \  U  in  line  10  of  Figure  4.  Hence  DBLT 

picks  these  pairs  successively  for  grouping  until  all  pairs 
have  been  picked. 

With  BLT,  the  are  no  longer  equal.  But  for 

sufficiently  large  n  they  will  still  all  be  less  than  B{U') 
for  any  other  candidate  pair  U' ,  by  Theorem  4.  Thus  BLT 
will  successively  group  the  pairs  . . ,  [/(™)  in  some 
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nodes  descended  from  i,  while  ks  is  some  other  candi¬ 
date  node  not  descended  from  i.  Since  the  grouping  thus 
far  is  correct,  ks  cannot  be  i  or  an  ancestor  of  i,  and 
hence  R{k3)  =  S3  C  Rr  \  Rrii)-  Let  Sj  =  Rr{kj), 
j  =  1,2.  All  the  Sj  are  disjoint.  By  arguments  similar 
to  those  used  in  the  proof  of  Theorem  2,  ^3})  = 

B{Si,S3)  >  B{Si,S2)  =  B{{ki,k2}).  Thus  correct 
grouping  of  ki ,  k2  by  BLT  is  guaranteed  if  (16)  holds  for 
all(5i,52,53)  G<S(i).  ■ 

Proof  of  Theorem  10:  Since  for  each  SCR,  7(5)  is 
I  (T  ,ct  ))  :=  ^T,a{^og{p{X-,T,a)/p{X-,T  ,a  )))he  mean  of  i.i.d.  random  variables  Yg^\  the  variables 

(36)  ^yn■  (7  —  7),  7  =  {7(5)}scfl,  converge  to  a  multivariate 


random  order  that  depends  on  the  relative  magnitude  of 
the  13 But  the  order  is  not  important,  since  the 
end  result  is  just  to  have  the  pairs  formed  as  DBLT  would 
have.  ■ 

Proof  of  Theorem  8:  It  suffices  to  show  that 
lim„_,.oo  Pr,a(7ML  =  T')  =  0  for  each  T'  ^  T,  Let 
p{x\  T,  a)  denote  the  probability  of  the  outcome  x  G 
{0, 1}^  under  the  loss  tree  (T,  a).  Under  our  assump¬ 
tions,  if  T'  ^  T,  the  Kullback-Leibler  information 


is  a  continuous  function  of  a'  G  Af, ,  and  is  strictly  pos¬ 
itive  because  of  identifiability.  Thus  there  is  a  number 
5  >  0  such  that  I  ((T,  a),  (T',  a'))  >  5  for  all  a'  G  Af, . 
Now 


(37) 


1  ^  p{Xii)-V,a') 
n  “  p{X(0-T,a) 


>  0 


Since  a'  G  Af, ,  the  density  p(x;  T',a')  is  bounded  away 
from  zero,  hence  the  conditions  of  Jennrich’s  [8]  uniform 
strong  law  of  large  numbers  are  satisfied.  Thus,  Pr,a- 
almost  surely. 


Gaussian  random  variable  as  n  — >•  00.  Since  £>  is  a  dif¬ 
ferentiable  function  V  of  7,  the  Delta  method  insures  that 
the  stated  convergence  holds. 

To  prove  (i)  observe  that  since  a(5i),  0(53)  -<  a{Si  U 
S3)  then  B{Si,S3)  =  A(a(5i  U  53)).  Since  Si  and  S2 
may  not  satisfy  a(5i),a(52)  -<  a{Si  U  52)-this  may 
occur  whenever  there  was  a  grouping  error  in  any  of 
the  steps  that  lead  to  the  construction  of  node  5i  and/or 
node  52-we  need  to  explicitly  write  the  expression  for 

B{Si,S2), 


B{Si,S2)  = 


=  1]P[\/ jds^^j  =  1] 


1 


-I{{T,a),{r,a'))<-6 


-  p{X(i)-,T,a) 

(38) 

uniformly  in  a'  G  Aft,  whence  the  RHS  of  (37)  con¬ 
verges  to  zero  as  n  — >•  00.  ■ 

Proof  of  Theorem  9:  Recall  from  the  proof  of  Theorem  8 
that  the  Kullback-Leibler  information  1(6, 6')  is  a  contin¬ 
uous  function  of  6' ,  and,  because  of  identifiability,  has  a 
unique  minimum,  namely  0,  at  6'  =  6.  Given  any  neigh¬ 
borhood  [/  of  0  G  0,  it  follows  that,  for  sufficiently  small 
£  >  0,  the  set  =  {O'  :  1(0,6')  <  £}  is  contained  in 
U.  Using  Theorem  7.80  of  Schervish  [17],  we  have,  for 
n  — >•  00, 

Tt(U\x)  — >  1,  Pe  —  a.s.  (39) 


■  Vjes^^j  =  1] 

=  P[-’6a(SiUS2)  =  1] 

_ P[VjeSi-^j  =  l|-^a(SiUS2)  =  1] _ 

^['^jeSiXj  =  l|X<j(SiuS2)  =  1)  '^jes^Xj  =  1] 

=  A(a(5i  U  52))V’(Si,S2)  (41) 

where  V’rsi  •=  pKT — -1]  ■ 
T- 1,01, 02;  P[Vjesi  -’tj— l!-’ta(SiuS2)~^’ 

Observe  from  Proposition  l(iv)  that '0(5^  5^)  1-  Intu¬ 

itively,  the  smaller  'ip(^Si,S2)’  *^Ite  greater  the  error  commit¬ 
ted  so  far  in  classifying  the  subtree  rooted  at  i.  (i)  then 
follows  as  for  ||a||  — >•  0  it  is  easy  to  verify  that  A(k)  = 
1  -  s(k)  +  0(||a||^)  and  X(Si.S2)  =  1  “  0(||a||^). 
To  prove  (ii),  a  standard  application  of  the  Delta  method 
shows  that  the  collection  of  -,/n(B(Si,  S2)  —  B(Si,  52)) 
converge  as  n  — >•  00  to  a  multivariate  Gaussian  random 
variable  with  mean  zero  and  covariance  matrix 


Consider  the  pseudo-Bayes  classifier  T-k,  which  now 
takes  the  form 

T-„(x)  =  argmax7r(r  x  A°\x).  (40) 

reTiR) 

From  (39)  we  obtain  that,  Pr,a  almost  surely,  7r({T}  x 
A!j-\x)  — >  1  and  7r({r}  x  At\x)  — >  0  for  r  ^  T, 
whence  T-k(x)  =  T  for  sufficiently  large  n,  Pr,a  almost 
surely.  ■ 

Proof  of  Lemma  1:  Assume  that  a  number  of  group¬ 
ings  have  been  formed,  after  which  ki ,  are  candidate 


^  dB(Si,S2)  ^  dB(S3,Sf) 
^(SuS2US2,S,)  Cs,s-  . 

(42) 

where  Cs,s'  =  Lov[y’J*\  Now,  following  the 

same  lines  of  Theorem  5  in  [3],  we  have  that  Cs,s'  = 

s(a(S  U  5'))  +  0(11  a  11^),  and  =  5(5, us’).^  + 

0(11  a  11^)  by  direct  differentiation.  Therefore,  we  have 
{83,84}  =  0(5iuS2).(S3US4)  +  0(||a  11^).  Hence, 

(r^{Sl,  82,83)  =  P(Si,S3),(Si,S3)  -1  P(Si,S2).(Sl.S2)  (^3) 

-2P(Si.S2).(Si.S3)  +'^(ll“ll^) 
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s(a(5i  U  52))  -  s(a(5i  U  53))  +  0(||  a  Ip) 


Finally,  (iii)  follows  as  s{a{Si  U  S2))  —  s{a{Si  U  53))  is 
minimized  when  a{Si  U  S2)  =  i  and  a{Si  U  S3)  =  f{i). 


References 

[1]  A.  Adams,  T.  Bu,  R.  Caceres,  N.G.  Duffield,  T.  Friedman,  J. 
Horowitz,  F.  Lo  Presti,  S.B.  Moon,  V.  Paxson,  D.  Towsley,  The 
Use  of  End-to-End  Multicast  Measurements  for  Characterizing 
Internal  Network  Behavior,  IEEE  Communications  Magazine, 
38(5),  152-159,  May  2000. 

[2]  J.  Berger,  Statistical  Decision  Theory  and  Bayesian  Analysis,  2nd 
ed..  Springer,  1985. 

[3]  R.  Caceres,  N.G.  Duffield,  J.Horowitz  and  D.  Towsley, 
“Multicast-Based  Inference  of  Network  Internal  Loss  Charac¬ 
teristics”,  IEEE  Trans,  on  Information  Theory,  45:  2462-2480, 

1999. 

[4]  R.  Caceres,  N.G.  Duffield,  J.Horowitz  F.  Lo  Presti  and 
D.  Towsley,  “Statistical  Inference  of  Multicast  Network  Topol¬ 
ogy”,  in  Proc.  1999  IEEE  Conf.  on  Decision  and  Control, 
Phoenix,  AZ,  Dec.  1999. 

[5]  A.  Dembo  and  O.  Zeitouni,  Large  Deviation  Techniques  and  Ap¬ 
plications.  Jones  and  Bartlett,  Boston-London,  1993. 

[6]  N.G.  Duffield  and  F.  Lo  Presti,  “Multicast  Inference  of  Packet 
Delay  Variance  at  Interior  Network  Links”,  in  Proc.  IEEE  Info- 
corn  2000,  Tel  Aviv,  March  2000. 

[7]  M.  Handley,  S.  Floyd,  B.  Whetten,  R.  Kermode,  L.  Vicisano, 
M.  Luby.  “The  Reliable  Multicast  Design  Space  for  Bulk  Data 
Transfer,”  RFC  2887,  Internet  Engineering  Task  Force,  Aug. 

2000. 

[8]  R.  Jennrich,  “Asymptotic  properties  of  nonlinear  least-squares 
estimators”,  Ann.  Math.  Stat.  40:633-643,  1969. 

[9]  B.N.  Levine,  David  Lavo  ,  and  J.J.  Garcia-Luna-Aceves,  “The 
Case  for  Concurrent  Reliable  Multicasting  Using  Shared  Ack 
Trees,”  Proc.  ACM  Multimedia  Boston,  MA,  November  18-22, 
1996. 

[10]  B.N.  Levine,  S.  Paul,  J.J.  Garcia-Luna-Aceves,  “Organizing  mul¬ 
ticast  receivers  deterministically  according  to  packet-loss  cor¬ 
relation,”  Proc.  ACM  Multimedia  98,  Bristol,  UK,  September 
1998. 

[11]  F.  Lo  Presti,  N.G.  Duffield,  J.Horowitz  and  D.  Towsley, 
“Multicast-Based  Inference  of  Network-Internal  Delay  Distribu¬ 
tions”,  submitted  for  publication,  September  1999. 

[12]  mtrace  -  Print  multicast  path  from  a  source  to  a  receiver.  See 
ftp://ftp.parc.xerox.com/pub/net-research/ipmulti 

[13]  ns  —  Network  Simulator.  See  http://www- 
mash.cs. berkeley.edu/ns/ns.html 

[14]  V.  Paxson,  J.  Mahdavi,  A.  Adams,  M.  Mathis,  “An  Architecture 
for  Large-Scale  Internet  Measurement,”  IEEE  Communications 
Magazine,  Vol.  36,  No.  8,  pp.  48-54,  August  1998. 

[15]  M.J.D.  Powell,  “Gradient  conditions  and  Lagrange  multipliers  in 
nonlinear  programming”.  Lecture  9  in  L.C.W.  Dixon,  E.  Spedi- 
cato,  G.P.  Szego  (eds.),  “Nonlinear  optimization:  theory  and  al¬ 
gorithms”,  Birkhauser,  1980,  p.  210 

[16]  S.  Ratnasamy  &  S.  McCanne,  “Inference  of  Multicast  Routing 
Tree  Topologies  and  Bottleneck  Bandwidths  using  End-to-end 
Measurements”,  in  Proc.  IEEE  Infocom’99,  New  York,  March 
1999 

[17]  M.J.  Schervish,  “Theory  of  Statistics”,  Springer,  New  York, 
1995. 

[18]  R.J.  Vanderbei  and  J.  lannone,  “An  EM  approach  to  OD  matrix 
estimation,”  Technical  Report,  Princeton  University,  1994 

[19]  Y.  Vardi,  “Network  Tomography:  estimating  source-destination 
traffic  intensities  from  link  data,”  J.  Am.  Statist.  Assoc.,  91 :  365- 
377, 1996. 

[20]  B.  Whetten,  G.  Taskale.  “Reliable  Multicast  Transport  Protocol 
II,”  IEEE  Networks,  14(1),  37^7,  Jan./Feb.  2000. 


164 


Multicast  Topology  Inference  from  End-to-end  Measurements* 


N.G.  Duffield^’^  J.  Horowitz^  F.  Lo  Presti^’^  D.  Towsley^ 


^  AT&T  Labs-Research 
180  Park  Avenue 
Florham  Park,  NJ  07932,  USA 
{duffield,lopresti}  ©research. att.com 


*Dept.  Math.  &  Statistics 
University  of  Massachusetts 
Amherst,  MA  01003,  USA 
joeh@math.umass.edu 


^Dept.  of  Computer  Science 
University  of  Massachusetts 
Amherst,  MA  01003,  USA 
towsley  @  gaia.cs.umass.edu 


Abstract 

This  paper  describes  a  class  of  statistical  estimators 
of  multicast  tree  topology  based  on  end-to-end  mul¬ 
ticast  traffic  measurements.  This  approach  allows  the 
determination  of  the  logical  multicast  topology  with¬ 
out  assistance  from  the  underlying  network  nodes. 
We  provide  five  instances  of  the  class,  variously  us¬ 
ing  loss  or  delay  measurements.  We  compare  their 
accuracy  and  computational  cost,  and  recommend 
the  best  choice  in  each  of  the  light  and  heavy  traf¬ 
fic  load  regimes. 

1  Introduction  and  Motivation 

The  use  of  mulficasf  shows  greaf  promise  for  deter¬ 
mining  infernal  nefwork  characferisfics  based  solely 
on  end-fo-end  measuremenfs.  This  is  because  mulfi- 
casf  infroduces  correlation  in  fhe  end-fo-end  behav¬ 
ior  observed  by  differenf  receivers  wifhin  fhe  same 
mulficasf  session.  This  correlation  can  be  used  lo 
estimate  packel  loss  rates,  [1],  packef  delay  disfri- 
bufions,  [5],  and  packel  delay  variances,  [4].  These 
melhods  can  be  used  as  pari  of  a  mullicasl-capable 
measuremenl  infraslruclure,  such  as  NIMI  (Nalional 
Inlernel  Measuremenl  Infraslruclure)  [9],  for  fhe  pur¬ 
pose  of  moniloring  infernal  nefwork  behavior. 

All  of  Ihese  mullicasl-based  melhods  can  be  ap¬ 
plied  lo  end-fo-end  mulficasf  observations  made  of 
packels  Iraversing  a  fixed,  bul  arbilrary,  lopology. 
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Knowledge  of  fhe  mulficasf  lopology  is  required  in 
order  lo  apply  fhe  melhods.  Unforlunalely,  knowl¬ 
edge  of  fhe  free  lopology  is  nol  always  available. 
This  molivales  fhe  need  for  algorilhms  lhal  can  iden¬ 
tify  fhe  lopology  of  fhe  mulficasf  free.  Anolher  mo- 
livalion  is  lhal  knowledge  of  fhe  mullicasl  lopology 
can  be  of  use  lo  mullicasl  applications.  For  example 
several  reliable  mullicasl  protocols  (e.g.,  RMTP  [8]) 
rely  on  logical  hierarchies  based  on  Ihe  underlying 
topology  if  possible.  Olher  applications  allempl  to 
group  receivers  lhal  share  Ihe  same  nefwork  bottle¬ 
neck,  [10]. 

In  Ihis  paper,  we  presenl  a  general  framework 
wilhin  which  to  develop  algorilhms  for  identifying 
fhe  mullicasl  topology  based  on  end-to-end  observa¬ 
tions.  Once  Ihe  topology  has  been  identified,  any  of 
Ihe  melhods  mentioned  above  for  identifying  inter¬ 
nal  nelwork  behavior  can  Ihen  be  applied.  The  de- 
velopmenl  of  an  algorilhm  is  based  on  Ihe  presence 
of  a  packel  performance  measure  lhal  monolonically 
increases  as  Ihe  packel  Iraverses  down  Ihe  free  and 
lhal  can  be  estimated  based  on  observations  made  al 
Ihe  receivers.  We  provide  additional  conslrainls  on 
Ihe  measures  such  lhal  Ihe  resulting  algorilhm  can  be 
shown  to  be  asymptotically  consislenl,  i.e.,  il  identi¬ 
fies  Ihe  correcl  topology  almost  surely  as  the  number 
of  observations  goes  to  infinity.  Examples  of  perfor¬ 
mance  measures  that  yield  such  algorithms  include 
loss  rate,  delay  variance,  average  delay,  and  link  uti¬ 
lization. 

Several  algorithms  have  been  proposed  for  identi¬ 
fying  multicast  topologies  based  on  loss  observations 
made  at  receivers.  For  example,  [10]  presented  an  al¬ 
gorithm  for  identifying  a  multicast  tree  when  it  is  a 
binary  tree.  [2]  established  the  correctness  of  this  al- 
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gorithm  and  introduced  several  other  loss-based  al¬ 
gorithms  for  identifying  general  trees.  They  con¬ 
cluded  that  an  algorithm  that  constructs  a  binary  tree 
and  subsequently  prunes  branches  whose  loss  rates 
are  close  to  zero  was  best  suited  for  topology  identifi¬ 
cation.  The  framework  presented  in  this  paper  comes 
from  the  recognition  that  this  last  approach  can  be 
applied  to  observations  of  other  end-to-end  measures 
such  as  delays. 

Topology  discovery  is  an  essential  part  of  sev¬ 
eral  current  measurement  infrastructure  projects,  in¬ 
cluding  CAIDA,  Felix,  IPMA,  NIMI  and  Surveyor; 
see  [3].  We  contrast  our  approach  with  that  of  the 
commonly  used  diagnostic  tools  traceroute  and 
mtrace  [6]  that  discover  physical  topology.  These 
require  cooperation  from  intervening  nodes  (in  the 
generation  of  ICMP  messages,  or  in  maintaining 
counters)  and  their  widespread  use  raises  issues  of 
scaling  in  topologies  with  many  leaves.  The  present 
methods  complement  these,  being  able  to  discover 
logical  multicast  topology  and  it  changes  without  co¬ 
operation  from  the  network,  using  measurement  traf¬ 
fic  whose  volumes  fhaf  scale  well  for  larger  fopolo- 
gies;  see  [2]  for  furlher  discussion. 

The  paper  is  organized  as  follows.  We  infroduce 
our  framework  in  Secfion  2  and  provide  condifions 
under  which  fhe  resulfing  algorifhms  are  sfrongly 
consisfenf.  Applicafions  fo  loss  and  delay  measures 
are  presenfed  in  Secfion  3.  In  Secfion  4  we  analyze 
fhe  probabilify  of  fopology  misclassificalion,  asymp- 
folically  for  large  numbers  of  probes.  Section  5  re- 
porfs  on  a  simulafion  sfudy  on  fhe  effecliveness  of 
differenl  algorifhms  obfained  fhrough  fhis  approach 
and  makes  recommendafions  as  fo  when  fhey  per¬ 
form  well.  Secfion  6  concludes  fhe  paper. 

2  A  Framework  for  Topology  Infer¬ 
ence 

Tree  Model.  The  physical  mulficasl  free  comprises 
acfual  nefwork  elemenfs  (fhe  nodes),  and  fhe  com- 
municafion  links  fhan  join  fhem.  When  a  packef  is 
mulficasl  fo  a  sel  of  receivers,  only  one  copy  of  fhe 
packef  Iraverses  each  link  of  fhe  free;  copies  are  cre¬ 
ated  al  fhe  branch  poinls  of  fhe  free,  one  per  oulgoing 
link.  The  logical  mulficasl  free  comprises  fhe  branch 


poinls  of  fhe  physical  free,  and  fhe  logical  links  be- 
Iween  fhem.  The  logical  links  comprise  one  or  more 
physical  links.  Thus  each  node  in  fhe  logical  free,  ex- 
cepl  fhe  leaf  nodes  and  possibly  fhe  roof,  musl  have 
2  or  more  children. 

Lei  T  =  {V,L)  denole  a  logical  mulficasl  free 
wilh  nodes  V  and  links  L.  We  idenlify  fhe  roof  node 
0  as  fhe  source  of  probes,  and  R  C  V  as  fhe  sel  of 
leaf  nodes  (idenlified  as  fhe  sel  of  receivers).  The 
sel  of  children  of  node  j  E  V  is  denoted  by  d{j). 
Each  node,  k,  aparl  from  fhe  roof  has  a  parenl  f{k) 
such  lhal  {f{k),k)  G  L.  Define  recursively  fhe  com¬ 
positions  /”  =  /  o  wilh  =  /.  We  will 
sometimes  refer  fo  fhe  link  {f{k),k)  as  simply  link 
k.  Nodes  are  said  fo  be  siblings  if  fhey  have  fhe  same 
parenl.  \ik  =  /™(j)  for  some  m  G  N  we  say  lhal  j 
is  descended  for  k  (or  equivalenlly  lhal  k  is  an  ances¬ 
tor  of  j)  and  write  Ihe  corresponding  partial  order  in 
V  as  j  -<  k.  a{i,j)  will  denote  Ihe  minimal  common 
ancestor  of  i  and  j  in  Ihe  A -ordering.  For  k  E  V 
we  lei  T{k)  =  {V (k),  L{k))  denote  Ihe  sublree  of  T 
lhal  is  rooted  al  k,  and  sel  R(k)  =  RnV (k). 

Tree  Marks.  The  experience  of  a  mulficasl  packel 
on  ils  passage  down  Ihe  free  is  modeled  by  a  random 
process  of  marks  x  =  {xh)kev-  Each  mark  Xk  lakes 
a  value  in  a  sel  X  appropriate  to  Ihe  problem  of  inter¬ 
est  Xk  specifies  Ihe  experience  of  a  packel  Iravers- 
ing  link  k.  In  Ihe  selling  of  packel  loss,  for  example, 
we  lake  X  =  {0, 1},  where  Xk  =  indicates  lhal  a 
packel  presenl  al  node  f{k)  is  successfully  Iransmil- 
led  to  node  k,  while  Xk  =  0  indicates  lhal  il  would 
be  lost 

Composing  Marks  Along  Paths.  A  palh  is  a  sel 
of  contiguous  links,  identified  by  Ihe  ordered  sel  of 
link  endpoinls  {ki, . . . ,  ki)  where  ki  =  /(A:j_|_i).  We 
will  sometimes  use  Ihe  nolalion  p{k)  to  denote  Ihe 
palh  {ki, . . .  ,ki)  where  ki  =  0  and  ki  =  k.  The  ex¬ 
perience  of  Ihe  mulficasl  packel  on  a  palh  is  oblained 
by  composing  Ihe  marks  from  each  link  on  Ihe  palh 
to  form  a  mark  for  Ihe  palh.  Composition  is  an  as¬ 
sociative  and  commulalive  binary  operation  ®  on  X. 
A  palh  p  =  {ki, . . .  ,ki)  has  mark  xp  formed  by 
successively  composing  Ihe  marks  of  ils  consliluenl 
links:  xp  =  Xk^  ®  ®  Xk^  ■  We  assume  lhal  X  con- 

lains  an  identify  z  such  lhal  z®  x  =  x  for  a\[  x  ^  X . 
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In  the  example  of  loss  we  can  compose  link  marks 
using  the  minimum  xi  ®  X2  =  xi  f\  X2-  This  mod¬ 
els  the  physical  property  the  loss  occurs  on  a  path 
if  it  occurs  for  any  link  on  the  path.  The  identity  is 
z  =  1. 

Measurements  and  Marks.  We  assume  that  the 
individual  marks  Xk  are  not  directly  knowahle. 
Rather,  end-to-end  measurements  comprise  the 
marks  Xp(^i.-^  along  paths  terminating  at  leaf  nodes 
k  E  R.  Our  task  will  he  to  infer  the  underlying 
topology  from  these  path  marks  alone.  This  informa¬ 
tion  can  also  he  used  to  to  characterize  the  individual 
links,  hy  inferring  the  distribution  of  their  link  marks. 

Mark  Aggregation.  We  also  equip  X  with  an  ag¬ 
gregation  operation  that  summarizes  the  experience 
of  packets  over  a  set  of  possibly  intersecting  paths. 
We  restrict  attention  to  binary  trees.  Aggregation  is 
then  a  binary  operation  ©  on  A.  The  multicasting 
property  is  reflected  in  composition  ®  being  distribu¬ 
tive  over  aggregation  ©,  i.e., 

{xi  ®  X2)  ©  {xi  ©  X3)  =  Xl®  (X2  ©  X3)  (1) 

for  any  xi,X2,X3  G  X.  The  reason  that  this  re¬ 
lation  holds  becomes  clearer  if  we  consider  a  four- 
node  logical  multicast  tree  with  root  0  having  a  single 
child  1,  the  latter  node  having  children  2,  3  that  are 
leaves.  Eq.  (1)  says  that  when  calculating  the  aggre¬ 
gate  mark  for  intersecting  paths  (1,  2)  and  (1,  3),  we 
can  factor  out  the  common  mark  on  the  common  link 
1.  When  dealing  with  loss,  for  example,  we  will  take 
aggregation  as  maximization,  i.e.  xi®X2  =  x\N  X2- 

Aggregating  Receiver  Marks.  Nodes  k  E  V  \ 
{R  U  {0})  have  two  children,  which  we  denote  by 
h{k)  and  h*{k).  For  each  k  E  V  \  {0}  we  define 
the  aggregate  marks  over  the  paths  to  all  receivers 
descended  from  k  recursively  through 

yik)  ^  (2) 

Xh{k)®Xh*(k)  otherwise 

When  ©  is  associative  (i.e.  (mi  ©  m2)  ©  m3  = 
mi  ©  {m2  ©  m3)  we  can  write  x^  = 

This  is  the  case  for  loss,  where  x^  is  1  if  the  packet 


reaches  some  receiver  descended  from  k  and  0  oth¬ 
erwise.  However,  we  have  one  example  where  ©  is 
not  associative. 

Mark  Distributions.  We  assume  that  the  marks 
are  independently  distributed  according  to  a  mark 
distribution  uj  =  {ujk)kev^  where  ojk  is  the  distribu¬ 
tion  of  the  mark  Xk-  The  distribution  u>p  of  the  com¬ 
posite  mark  of  a  path  p  =  {ki, . . .  ,ki)  is  determined 
by  convolution  in  the  usual  way:  for  a  measurable 
subset  B  of  X,  u)p{B)  =  {u)k-^  *  ...  *  uJkf  ){B)  := 
1)  •  •  •  The  joint  distri¬ 

bution  of  , . . . ,  Xki  will  be  denoted  ujki,...,kf 

Deterministic  Reconstruction  of  Binary  Trees. 

The  classification  of  trees  relies  on  being  able  to 
identify  certain  characteristics  of  paths  that  do  not 
terminate  at  leaves  from  the  characteristics  of  those 
that  do.  Such  a  characteristic  (p  will  be  termed 
estimable',  the  precise  definition  is  below.  We 
seek  estimable  characteristics  that  increase  as  paths 
lengthen;  this  allows  us  to  select  as  siblings  those 
nodes  for  which  the  characteristic  (p  on  the  common 
portion  of  the  path  from  0  is  maximized. 

Let  T  =  (H,  L)  be  a  binary  tree.  Let  12  be  a  set 
of  probability  distributions  on  X  that  is  closed  un¬ 
der  convolution,  and  denote  by  ^  the  correspond¬ 
ing  product  distribution  of  marks  on  T.  Let  <phe.  a 
weakly  continuous  functional  the  set  of  measures  on 
X  that  takes  values  in  some  linear  space  Q.  We  equip 
Q  with  the  usual  partial  order,  i.e.,  («)  >  {Qd  in  Q 
iff  qi  >  q[  for  all  i.  Let  5z  denote  the  distribution 
which  has  unit  mass  at  the  composition  identity  z. 

Definition  1  (i)  cp  is  called  estimable  if  there  ex¬ 

ists  a  function  $  such  that,  for  each  u)  G 
QX ,  and  j,k  G  V  with  a{j,k)  /  j,k, 
4>{^p(a(j,k)))  =  ^{^j,k)- 

(ii)  (p  is  called  increasing  if  (p{5z)  =  0  G  Q,  and 
for  all  w  G  12,  cp{cop)  <  (pi'^q)  when  p  is  a 
proper  subpath  of  q. 

The  condition  (p{5z)  =  0  says  that  a  link  whose 
marks  never  change  the  marks  of  paths  traversing  it, 
do  not  increase  the  value  of  cp. 

Given  an  estimable,  increasing  cp,  and  a  distribu¬ 
tion  uj,  write  ^j^k  =  ^{^j,k)-  The  topology  can 
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1.  Input:  The  set  of  receivers  i?  =  {h,  •  •  • ,  v}  and 

the  leaf  mark  distributions 

2.  R'  :=  R,  V  :=R'-,L'  =  $; 

3.  while  |ii!'|  >  1  do 

4.  select  U  =  {j,  k}  C  R'  with  maximal 

5.  V  :=  V  U  {U}-, 

6.  L'  =  L'U{{U,i):iEU}-, 

I.  R  :=  {R\U)yj{uy, 

8.  enddo 

9.  if  >  0  do 

10.  V' :=V'U{0};L' =  L'U{{0,R)}; 

II.  enddo 

12.  Output:  tree  {V' ,  L')  ; 

Figure  1:  Deterministic  Binary  Tree  Classification 
Algorithm  (DBT). 

he  reconstructed  from  (p  as  follows.  The  key  ob¬ 
servation  from  (ii)  is  that  ,k  >  ^ji,k'  whenever 
a{j,k)  -<  a{j',k').  Thus  ^j^k  is  maximized  when 
j,  k  are  siblings  in  R.  For  if  not,  then  one  of  the 
receivers,  say  j,  would  have  a  sibling  U  for  which 
^j^k'  >  Thus  the  siblings  can  be  identified  on 
the  basis  of  leaf  distributions  alone.  Substituting  a 
composite  node  that  represents  their  parent  and  iter¬ 
ating,  should  then  reconstruct  the  binary  tree.  This 
approach  is  formalized  in  the  Deterministic  Binary 
Tree  Classification  Algorithm  (DBT);  see  Figure  1. 

DBT  operates  as  follows.  R  denotes  the  current 
set  of  nodes  from  which  a  pair  of  siblings  will  be 
chosen,  initially  equal  to  the  receiver  set  R.  We  first 
find  the  pair  U  =  {j,  k}  that  maximizes  ^j^k  (liric 

4) .  This  identifies  the  members  of  U  as  siblings,  and 
the  set  U  is  used  to  represent  their  parent.  Corre¬ 
spondingly,  we  add  j,  k  to  the  list  V'  of  nodes  (line 

5) ,  we  add  {U,j),  {U,  k)  to  the  list  L'  of  links  (line 

6) ,  and  replace  j  and  A:  by  C  in  the  set  R  of  nodes 
available  for  pairing  in  the  next  stage  (line  7).  This 
process  is  repeated  until  all  sibling  pairs  have  been 
identified  (loop  from  line  3).  Finally,  we  test  in  line 
9  whether  the  last  node  grouped  should  be  taken  as 
the  root  node.  If  the  last  node  identified  were  not 
the  root  node,  equality  in  the  test  would  contradict 
the  increasing  property  of  (p.  Otherwise,  we  adjoin  a 
root  node,  and  a  link  joining  it  to  its  single  descen¬ 
dant  (line  10). 


We  say  that  DBT  reconstructs  the  binary  logical 
multicast  tree  {V,L)  if  given  the  receiver  set  R  and 
the  leaf  mark  distribution  it  produces  {V,  L) 

as  its  output.  Clearly  this  happens  if  and  only  if  be¬ 
fore  each  iteration  of  the  while  loop  3  in  Figure  1, 
{V ,  L')  can  be  decomposed  in  terms  of  disjoint  sub¬ 
trees  V  =  J2keR'^i^)  =  J2keR'^i^)- 

These  subtrees  may  just  be  trivial  ones  T{k)  = 
({A:},  0)  comprising  a  root  node  k.  We  note  also  that 
these  trees  cover  R,  i.e.  R  =  UkeRi R{k).  These 
properties  hold  before  the  first  while  loop,  and  hold 
subsequently  since  each  loop  of  a  successful  recon¬ 
struction  amalgamates  binary  subtrees  rooted  at  sib¬ 
lings. 

Theorem  1  Let  T  be  a  binary  tree,  equipped  with 
an  estimable  and  increasing  function  (p.  Then  DBT 
reconstructs  T. 

Proof  of  Theorem  1:  Suppose  the  algorithm  does 
not  reconstruct  the  tree.  Then  there  must  be  an  it¬ 
eration  of  the  while  loop  for  which  j  and  k  in  line 
4  of  Figure  1  are  not  siblings.  Consider  R,V'  at 
the  start  of  the  first  loop  that  this  occurs.  Let  I  be 
the  sibling  of  j.  i  ^  R  since  a{j,£)  -<  a{j,k) 
implies  >  ^j,k’  contradicting  the  maximal- 
ity  of  ^j,k-  Since  the  subtrees  comprising  {V*,L') 
are  disjoint,  no  ancestor  of  j  (or  hence  of  i)  can 
lie  in  R.  Since  the  tree  is  binary,  i  must  have  at 
least  two  descendents  ti ,  t2  in  R  since  otherwise 
LlreR'R{r)  would  not  cover  R.  Since  a{ti,t2)  -<  i, 
then  >  4>{^p{l))  >  4>{^p{a{j,k)))  =  ^j,k,  con¬ 

tradicting  the  maximality  of  | 

Reconstruction  of  Binary  Trees  from  Measure¬ 
ments.  Now  we  switch  to  the  context  that  a  stream 
of  probes  is  dispatched  from  the  source,  each  giv¬ 
ing  rise  to  an  independent  realization  of  the  mark 
process.  Let  denote  the  such  realization. 
Each  realization  gives  rise  to  a  set  of  measurements 
{^Rk)  ■  k  E  R}  at  the  leaves.  Suppose  that  some 
subtree  {V (k),  L{k))  of  the  tree  is  already  identified. 
Then  we  can  aggregate  the  measured  leaf  marks  anal¬ 
ogously  to  (2),  defining  for  k  E  R, 

and  forming  ©  x^h*{k)  recursion  for 

k^R. 
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Let  =  n  ^  Ylt=i  denote  the  empirical 

distribution  of  ;  here  5y  denotes  the  unit  mass 
at  y  G  We  estimate  T  by  the  topology  7^”^  ob¬ 
tained  by  using  the  in  place  of  u)  in  the  DBT 
algorithm.  Specifically,  we  use  •= 
place  of  iri  line  3  of  Figure  1 .  We  call  the  re¬ 
sulting  algorithm  the  Binary  Tree  Classification  Al¬ 
gorithm  (BT). 

Theorem  2  Under  the  conditions  of  Theorem  1,  with 
probability  1,  =  T  for  sufficiently  large  n. 

Hence  is  a  consistent  estimator  of  T  and 

lim^^oo  /  71  =  0. 


can  be  shown  that  for  sufficiently  many  probes,  this 
approach  reconstructs  any  general  tree  for  which  all 

4>k  >  £■ 

3  Instances  of  Topology  Inference 

In  this  section  we  specify  instances  of  the  framework 
described  above,  specifying  the  setting  (the  marks 
X,  etc)  and  the  the  forms  of  the  functions  f,  $  and 
$(«)_  Theorem  1  and  2  then  apply  immediately  in 
each  case. 

3.1  Loss-Based  Inference 


Proof  of  Theorem  2:  By  the  law  of  large  numbers, 
cul”!  converges  weakly  to  co,  almost  surely.  Since 
(j)  is  weakly  continuous  each  converges  almost 
surely  to  ^j,k-  Then,  almost  surely,  for  all  suffi¬ 
ciently  large  n,  the  relative  ordering  of  the  is 
the  same  as  that  of  the  for  pairs  j,  k  for  which 
the  are  distinct.  Hence  BT  reconstructs  the  tree 
in  the  same  manner  as  D  BT,  except  possibly  varying 
the  order  of  grouping  amongst  sets  of  sibling  pairs 
(j,  k)  with  identical  ^j^k-  The  last  two  statements 
then  follow  by  standard  results,  g 


Characterizing  Link  Behavior.  In  many  of  the 
examples  of  the  next  section  f  is  additive  over  links, 
i.e.  (f>{uji  *  UJ2)  =  (f>{uJi)  +  fiojf}-  Then  we  can 
ascribe  a  descriptor  such  as  a  packet  loss  rate, 
to  each  link  {f{k),k)  through  =  f{^p{k))  ~ 
4>{^p{f{k)))-  (This  may  conveniently  be  done  dur¬ 
ing  the  execution  of  DBT  or  BT). 

Extension  to  General  Trees.  Inference  of  general 
trees  is  accomplished  as  follows.  For  simplicity  as¬ 
sume  that  (j)  is  additive.  Then  application  of  DBT  to 
an  arbitrary  tree  results  in  a  binary  tree  that  contains 
fictitious  links  k  such  that  fk  =  0.  The  tree  can  then 
be  pruned  by  removing  any  such  link  and  identify¬ 
ing  its  endpoints.  It  can  be  shown  that  this  proce¬ 
dure  yields  the  true  general  tree.  In  BT  it  is  neces¬ 
sary  to  apply  a  threshold  e  >  0  and  prune  all  links 
k  with  (f>k  ^  This  is  because  for  finitely  many 
probes,  statistical  fluctuations  lead  the  characteristic 
(j)  of  the  fictitious  links  to  differ  slightly  from  zero.  It 


In  this  case  f  is  (a  function  of)  the  probability  of  suc¬ 
cessful  transmission  from  the  root  to  a  given  node.  In 
the  above  formalism,  we  take  X  =  {0, 1},  where  0 
indicates  packet  loss  and  1  transmission.  Composi¬ 
tion  is  by  taking  minima  rri  ®  rE2  =  rri  A  X2  and  the 
identity  is  z  =  1;  a  packet  is  transmitted  on  a  path  if 
it  would  be  transmitted  on  all  links  of  that  path.  Ag¬ 
gregation  is  by  taking  maxima  xi  ®  X2  =  xi  \/  X2', 
hence  5?;^  =  1  if  ^  packet  reaches  any  receiver  de¬ 
scended  from  k.  It  can  be  shown  [1]  that 


^uj[xp{a{j,k))  1] 


P^[xj  =  l]Pcv[Xk  =  1] 

P^[Xj  =ff,  =  1] 


Using  Q  =  ]R_|_,  we  define  f  to  act  on  the  generic 
measure  on  X  as  —  a)So  +  aSi)  =  —  log(Q;) 
for  a  G  [0, 1].  Clearly  4>{5z  =  0).  We  write  ^j^k  = 
logP^[Sj  =  Xk  =  l]-logP^[Jj  =  l]-\ogP^\xk  = 
1].  (f)  is,  additive  over  links,  and  the  link  charac¬ 
teristic  is  fk  =  “  log  Pw*  [^^A:  =  1]>  i-O-i  tho  nega¬ 
tive  log  probability  of  successful  transmission  over 
link  k.  Thus  f  is  increasing  provided  the  link  loss 
probabilities  are  strictly  positive.  For  inference  from 
measurements,  =  logn  +  log(X1Li  “ 

log(I]r=i  “  log(I]r=i  ^k^)  where  we  have  ex¬ 
pressed  the  various  probabilities  in  terms  of  the  em¬ 
pirical  means. 


3.2  Delay  Covariance-Based  Inference 

In  this  case  the  increasing  function  f  is  the  variance 
of  the  cumulative  delay  from  the  root  to  a  given  node. 
In  the  formalism,  A  =  M,  with  Xk  the  delay  encoun¬ 
tered  on  link  k.  (The  formalism  extends  to  loss  by 
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using  =  oo  to  denote  loss;  we  treat  this  else¬ 
where  [4]).  Composition  adds  delays  along  a  path: 
xi  <Si  X2  =  xi  +  X2-  The  identity  is  z  =  0.  Ag¬ 
gregation  takes  the  mean  of  two  delays  xi  ®  X2  = 
{xi  +  rE2)/2.  Taking  Q  =  ]R_|_  and  =  Vartj(rE) 
we  take 


'P{^p{a{j,k)))  '^^^Lo{Xp(^a{j,k)))  (4) 

=  Coy  ^{xj,Xk)  =  ^j,k- 

The  middle  equality  holds  since,  hy  the  inde¬ 
pendence  assumption,  the  only  non-zero  contri- 
hution  to  Cov^ (xj ,Xk)  is  due  to  delays  on  the 
common  portion  of  the  paths  to  j  and  k.  By 
the  independence  assumption  (p  is  additive  and 
so  (pk  =  Vartj(rEA;),  the  delay  variance  of  link 
k.  (p  is  increasing  provided  delays  are  not  con¬ 


stant.  is  the  sample  covariance,  i.e.,  = 

i  f  V”  _  1  v”  V” 

n  ^k  n  2^i=l  2^i=l  ^k  )  ' 


and  A^Q^k)  (*)>  *  =  1,  •  •  • ,  d,  is  recursively  computed 
as  the  smallest  solution  of  the  following  quadratic 
equation: 


•n 


te{j,k} 


1  - 


+  EJnii  [1  -  Mi  -  m)]  -  l} 

+Mu,k)ii)  [1  -  MO)]  -  l}  =  0 

Where  /?,(*)  =  ^  ^ 


(6) 


{j,  k}. 


In  the  first  instance  we  can  take  Q  as  the  set  of 
ccdf’s  arising  from  the  delay  distrihutions.  Exclud¬ 
ing  from  Cl  trivial  distrihutions  in  which  all  link  de¬ 
lays  are  zero,  then  since  link  delays  are  non  negative, 
the  map  (p  taking  distrihutions  to  ccdf’s  is  increasing 
in  the  sense  of  Definition  l(i). 


3.3  Delay  Distribution-Based  Inference 


With  inference  supplying  the  full  distrihution  of  the 
cumulative  delay  from  the  root  to  a  given  node, 
there  are  several  choices  of  the  increasing  function 
(p  available:  the  complementary  cumulative  distrihu¬ 
tion  function  (ccdf),  the  delay  moments,  and  the  de¬ 
lay  variance. 

In  the  formalism,  X  =  {q,2q,...,  dq,  oo},  q  > 
0,  d  G  N,  with  Xk  the  delay  on  link  k,  discretized  in 
bins  of  width  q.  dq  is  a  threshold  delay  above  which 
packets  are  considered  lost,  Xk  taking  the  value  oo. 
Composition  adds  delays  along  a  path:  xi  ®  X2  = 
xi  +  X2-  The  identity  is  z  =  0.  Aggregation  takes 
minimum  delay  between  paths  xi  ®  X2  =  xi  f\  X2', 
hence  Xk  =  y  A  the  minimum  delay  from  the  source 
to  some  receiver  descended  from  k  is  y.  In  [5]  it 
was  shown  how  a  generalization  of  the  approach  for 
loss  inference  in  Section  3.1  above  can  be  used  to  ex¬ 
press  the  discretized  distribution  (xp(^a{j,k))  of  de¬ 
lay  from  the  root  to  an  interior  node,  in  terms  of  the 
distribution  ujj^k  aggregate  delays  to  leaf  nodes  de¬ 
scended  through  offspring  j,  k.  More  precisely,  de¬ 
note  =  P[xp(^k)  =^l'lk{i)  =  P[xp(^k)  <  iQl 
and  jj^kii)  =  P[xp{j)  ©  Xp(k)  <iq],i  ^  X.  Then: 


A  ■  m)=  Tj(0)7fc(0) 

7j(0)  +7k(0)  -7j,k(0) 


(5) 


In  order  to  avoid  comparing  entire  distributions, 
we  can  instead  compare  summary  statistics.  Since 
link  delays  are  non-negative  then  any  function  of  the 
form  (p(u;)  =  P.;^[h{x)  |  rr  <  oo]  is  estimable  and 
increasing  when  h  is  an  increasing  function,  e.g., 
h{x)  =  x^,p  >  0.  (Here  x  represents  a  generic  mark 
with  distribution  uj.)  A  special  case  is  the  delay  aver¬ 
age  estimator,  obtained  whenp  =  1.  This  is  additive 
since  the  mean  of  the  sum  of  two  random  variables 
is  the  sum  of  their  means.  Another  estimator  is  the 
delay  variance  estimator:  (p{uj)  =  Var^^lxlx  <  oo]. 
This  is  additive  due  to  the  independence  of  link  de¬ 
lays. 

For  the  delay  average  and  variance  classifiers, 

use  ^j^k  Puj[-^p(a{j,k))  I  ■^p(a(j,k))  ^  ®®]  • 

Si=0  (*)/ Si=0  (*)  ~ 

[^p(a(j,A:))  I  ■^p(a(j,k))  ^  ®®] 

Si=0  ^a(i,A:)(*)  “ 

^l[xpa(j,k)  I  Xp{a{j,k))  <  oo]>  respectively,  where 
we  condition  on  the  delay  being  finite.  The  corre- 

(n) 

spending  are  computed  using  the  estimated  dis¬ 
tribution  (*),  computed  through  (5)  and  (6)  us¬ 
ing  the  estimates  :=  Em=i  ^ 

and  :=  Em=i  1  <^,1  in 

place  of  I  G  {j,  k},  and  7j,k{i),  *  G  A. 
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3.4  Utilization-Based  Inference 

In  this  case  the  increasing  function  (p  is  (a  function 
of)  the  prohahility  of  encountering  minimal  delay  at 
all  links  in  a  path.  This  case  can  he  regarded  as  a 
degenerate  case  of  the  delay  distrihution  inference  in 
which  X  =  {0, 1}  where  =  0  indicates  no  delay 
on  link  k,  while  Xk  =  I  corresponds  to  any  non¬ 
zero  queueing  delay.  Hence  the  fraction  of  packets 
that  experience  =  1  is  a  direct  measure  of  the 
utilization  of  link  k.  Since  Xp(^i.-^  =  0  iff  Xj  =  0  for 
all  j  in  the  path  p{k),  the  setup  maps  exactly  onto  the 
loss  inference  described  in  Section  3.1,  except  with 
the  roles  of  0  and  1  interchanged. 

4  Misclassification  Analysis 


and  delay  covariance  case,  for  the  other  estimators  it 
follows  from  Theorem  3  in  [5]. 

Theorem  3  Under  the  conditions  of  Theorem  1,  for 
each  i  E  V\R,  ^  -  I,  k)  -  Di{j,  I,  k)), 

{j,l,k)  G  S(i),  where  Di(j,l,k)  = 
converges  in  distribution,  as  the  number  of  probes 
n  — )■  oo,  to  a  Gaussian  random  variable  with 
mean  0  and  variance  (7jj.{j,l,k)  =  limjj_).oo  n  • 

(var(<|.;”')  +  Var(4“)  -  2Cov(<|.l”l, 

Theorem  3  suggests  we  can  approximate 
P[QiU,l,k)]  hy  ^  where  T- 

is  the  cdf  of  the  standard  normal  distrihution.  For 
large  n,  we  have  the  following  leading  exponential 
approximation 


In  this  section,  we  analyze  the  prohahilities  of  mis¬ 
classification  for  the  various  instances  of  BT,  and  es¬ 
timate  their  convergence  rates. 

Denote  hy  Ei  the  event  that  BT  has  correctly  re¬ 
constructed  the  subtree  rooted  at  node  i,  i  G  V. 
Since  the  algorithm  proceeds  iteratively  up  the  tree, 
Ei  requires  first  that  both  the  subtrees  rooted  at  its 
child  nodes  have  been  correctly  reconstructed,  then 
that  its  child  nodes  have  been  then  paired  together. 
Therefore,  for  *  G  U  \  i?  we  can  write 

—  ^h{i)  ^  :  j:  (7) 

where  S{i)  =  {{h{i),  h*{i),  k),  {h*{i),  h{i),  k)\i,  k  / 
a{i,  k)}  and  Q{1,  j,  k)  is  the  event  that 

(8) 

holds.  In  Ui^i,j,k)ciS{i)Q{l,hk),  h(i)  and  h*(i)  are 
grouped  together  to  form  node  i  for  all  possible  ways 
to  reconstruct  the  tree.  Denote  by  E  the  event  that 
the  tree  is  correctly  classified.  ^From  (7)  we  imme¬ 
diately  have  that  E  D  j,  k). 

This  provides  the  following  upper  bound  for  the  mis¬ 
classification  probability,  denoted  by 

pf  ■.=  p[E^]<  Y,  E 

iev\R{i,j,k)eS{i) 

Normal  Approximations  It  can  be  shown  that 
has  asymptotically  Gaussian  dis¬ 
tribution  as  the  number  of  probes  n  — )■  oo  in  all  in¬ 
stances  described  in  Section  3.  See  [2,  4]  for  the  loss 


pmj,i,k)] 


-(nl2)Dl  {j,l,k)ali ,  (j,l,k) 

O  * 


(10) 


where  the  exponent  is  given  by  the  dominant  term  of 
the  ratio.  Since  the  largest  term  over  LJ(£y\^^5(*)  in 
(9)  should  dominate  for  large  n,  we  expect  the  curve 
log  vs.  n  to  be  asymptotically  linear  with  nega¬ 
tive  slope 


iev\R(i,i,lfeS(i)  crliXj,  I,  k) 


inf 


inf 


(11) 


4.1  Misclassification  Probability  for  the  Dif¬ 
ferent  Classifiers 

The  foregoing  analysis  can  be  instantiated  with  the 
different  estimators  to  obtain  the  misclassification 
probability  of  the  topology  classifiers.  In  the  fol¬ 
lowing  we  compute  the  asymptotic  behavior  of  the 
different  classifiers  by  substituting  the  proper  expres¬ 
sions  in  (11).  In  general,  the  calculation  of  the  infi- 
mum  in  (11)  is  quite  difficult  since  o|),(j,  I,  k)  is  a 
complex  function  of  both  the  topology  and  the  distri¬ 
bution  (jO.  Here,  we  capture  the  dominant  modes  of 
misclassification  in  asymptotic  regime  of  small  loss 
and  delay.  The  results  in  [2]  and  Theorem  3  in  [5] 
suggest  that  in  this  regime,  the  curve  log  P^  vs.  n  is 
asymptotically  linear  with  negative  slope 

1.  for  the  loss  based  classifier 
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-  inf  P[xi  =  0]  (12) 
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Figure  2:  Model  Simulation.  Fraction  of  correctly  classified  topologies  for  different  classifiers  as  func¬ 
tion  of  the  number  of  probes:  (a)  light  load  scenario;  (b)  heavy  load  scenario. 
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4.  for  the  variance  based  classifier 
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5  Simulation  Evaluation  and  Algo¬ 
rithm  Comparison 

In  this  section  we  compare  the  performance  of  the 
different  classification  algorithms  through  two  types 
of  simulation.  In  model  simulations  delay  and  loss 
are  chosen  to  follow  our  statistical  model,  allowing 
(14)  us  to  test  algorithm  performance  in  the  setting  on 
which  our  analysis  is  based.  Network  simulations, 
using  the  ns  [7]  simulator,  test  the  algorithms  in  a 
more  realistic  setting,  where  delay  and  loss  are  due  to 
queueing  delay  and  buffer  overflows  at  nodes  as  mul¬ 
ticast  probes  compete  with  background  TCP/UDP 
traffic. 


We  were  not  able  to  establish  an  analogous  re¬ 
sult  for  the  covariance  based  approach.  In  this  case 
we  used  our  experience  from  experiments  to  deter¬ 
mine  which  event  dominates  misclassification.  We 
observe  in  most  experiments  that,  as  n  increases,  the 
most  likely  way  to  misclassify  a  tree  is  by  incorrectly 
identifying  the  link  with  the  smallest  link  variance; 
this  happens  by  mistakenly  grouping  one  of  its  child 
nodes  with  its  sibling  node.  This  suggests  the  fol¬ 
lowing  approximation  for  the  covariance  approach 


-(n/2) 

e 


Var^  {x  j  ) 


(16) 


where  j  =  arg  minjgy\^^Var(a;j). 


5.1  Model  Simulation 

In  the  model  simulations,  at  each  link  a  probe  is  ei¬ 
ther  lost,  or  encounters  no  delay,  or  suffers  an  expo¬ 
nentially  distributed  delay.  We  conducted  1000  sim¬ 
ulations  over  random  generated  15  node  binary  trees. 
In  Figure  2  we  plot  the  fraction  of  correctly  classified 
topologies  as  a  function  of  the  number  of  probes  for 
the  different  classifiers.  We  considered  two  regimes: 
a  light  load  regime  with  low  loss  (1%)  and  utiliza¬ 
tion  (randomly  chosen  between  10%  and  40%),  and 
a  heavy  load  regime  with  higher  loss  (randomly  cho¬ 
sen  between  1  %  and  20%)  and  utilization  (randomly 
chosen  in  between  30%  and  80%).  In  both  cases, 
mean  delays  were  randomly  chosen  between  0.2ms 
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Figure  3:  ns  SIMULATION:  (a)  simulation  topology;  (b)  fraction  of  correctly  classified  topologies  for 
different  classifiers  as  function  of  the  number  of  probes 


and  2ms.  We  adopted  a  delay  granularity  of  1ms. 
(Although  this  seems  large  compared  with  mean  de¬ 
lays,  the  delay  predictions  are  quite  accurate;  see 
[5]). 

The  loss  based  classifier  is  found  to  be  most  accu¬ 
rate  in  general.  The  exception  is  with  small  numbers 
of  probes  at  small  loss  rates.  Then  rare  losses  provide 
insufficient  data  points,  and  accuracy  is  greater  for 
the  utilization  and  average  delay  classifiers.  The  uti¬ 
lization  based  classifier  has  the  best  accuracy  among 
the  delay  distributions  based  classifiers.  This  was  ex¬ 
pected  because  the  delay  average  and  variance  ap¬ 
proaches  use  estimates  of  the  entire  delay  distribu¬ 
tion  while  the  utilization  approach  uses  estimates  of 
only  the  first  bin;  estimation  of  the  weights  of  lower 
bins  being  found  to  be  more  accurate.  Similarly,  de¬ 
lay  average  is  more  accurate  that  delay  variance  since 
it  attaches  less  weight  to  higher  delay  bins. 

From  the  plots,  the  utilization  based  approach  ap¬ 
pears  to  work  very  well  with  low  links  utilizations, 
while  its  performance  degrades  with  higher  utiliza¬ 
tions,  which  is  in  contrast  with  (13).  This  can  be 
explained  by  observing  that  the  theorem  holds  for 
the  limit  behavior  as  utilization  goes  to  0;  in  our  ex¬ 
periments,  we  found  that  (13)  captures  well  the  mis- 
classification  behavior  for  utilization  up  to  20%.  On 
the  other  hand,  as  link  utilizations  increase,  the  num¬ 
ber  of  events  used  by  the  algorithm,  namely  those  of 
minimum  end-to-end  delay,  decreases  rapidly.  This 
results  in  increased  estimator  variance. 


The  two  best  performing  algorithm,  (loss  and  uti¬ 
lization  based)  have  the  smallest  computational  com¬ 
plexity.  All  algorithms  require  0{#I^)  node  pairs 
computations.  Each  of  these  is  0{n)  for  the  loss, 
utilization  and  covariance  based  estimators.  These 
are  considerably  more  complex  for  the  delay  average 
and  variance  classifiers  since  the  whole  delay  dis¬ 
tribution  must  be  calculated,  by  recursively  solving 
quadratic  equations,  the  number  of  which  is  inversely 
proportional  to  the  bin  size  q. 

5.2  TCP/UDP  Network  Simulation 

The  ns  simulations  used  the  topology  shown  in  Fig¬ 
ure  3(a).  To  capture  the  heterogeneity  between  edges 
and  core  of  a  WAN,  interior  links  have  higher  capac¬ 
ity  (5Mb/sec)  and  propagation  delay  (50ms)  then  at 
the  edge  (1  Mb/sec  and  10ms).  Each  link  is  modeled 
as  a  EIEO  queue  with  a  4-packet  capacity. 

The  root  node  0  generates  probes  as  a  20Kbit/s 
stream  comprising  40  byte  UDP  packets  according 
to  a  Poisson  process  with  a  mean  interarrival  time  of 
16ms;  this  represents  2%  of  the  smallest  link  capac¬ 
ity.  The  background  traffic  comprises  a  mix  of  in¬ 
finite  data  source  TCP  connections  (ETP)  and  expo¬ 
nential  on-off  sources  using  UDP.  Averaged  over  the 
different  simulations,  the  link  loss  ranges  between 
1%  and  11%  and  link  utilization  ranges  between  20% 
and  60%.  The  average  delay  ranges  between  1  and 
2ms  for  the  slower  links  and  between  0.2  and  0.5ms 
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for  the  faster  links.  The  delay  distrihutions  were 
computed  using  a  hin  size  of  1ms. 

In  Figure  3(h)  we  plot  the  fraction  of  correctly 
identified  topologies  over  100  simulations.  The  rel¬ 
ative  accuracy  among  the  different  classifiers  is  in 
agreement  with  the  results  from  the  model  simula¬ 
tion  with  the  loss  based  algorithm  having  the  best 
performance  with  no  misclassification  for  more  than 
500  probes.  The  rather  poor  performance  of  the  de¬ 
lay  based  algorithms,  with  the  exception  of  the  uti¬ 
lization  classifiers,  is  largely  due  to  the  presence  of 
spatial  correlation.  In  our  simulations,  a  multicast 
probe  is  more  likely  to  experience  a  similar  level  of 
congestion  on  consecutive  links  or  on  sibling  links 
than  is  dictated  by  the  independence  assumption. 
This  has  negative  impact  on  the  accuracy  of  the  de¬ 
lay  estimates  which  accounts  for  the  observed  per¬ 
formance. 

We  also  observed  temporal  correlation  among  suc¬ 
cessive  probes  that  encountered  the  same  congestion 
events.  However,  it  can  be  shown  that  the  presence 
of  short-term  correlation  does  not  affect  estimator 
consistency,  although  the  convergence  rate  may  be 
slowed. 

6  Conclusions 

In  this  paper  we  have  presented  a  general  frame¬ 
work  for  the  inference  of  the  multicast  tree  topolo¬ 
gies  from  end-to-end  measurements.  In  contrast  with 
tools  such  as  mtrace  [6],  cooperation  of  interven¬ 
ing  network  nodes  is  not  required. 

We  specified  an  algorithm  which  reconstructs  the 
topology  of  multicast  tree  in  presence  of  any  packet 
performance  measure  that:  (i)  monotonically  in¬ 
creases  as  the  packet  traverses  down  the  tree;  and 
(ii)  that  can  be  estimated  on  the  basis  of  end-to-end 
measurements  at  the  receivers.  Building  on  previous 
results  in  [1, 4,  5],  we  were  able  to  specify  several  in¬ 
stances  of  this  algorithm  based  on  performance  mea¬ 
sures  as  packet  loss,  link  utilization,  delay  average 
and  delay  variance. 

We  investigated  the  statistical  properties  of  the  al¬ 
gorithms,  and  showed  that,  under  mild  assumptions, 
they  are  consistent  and  computed  their  convergence 
rate.  We  evaluated  our  classifiers  though  simulation. 
We  found  out  that  the  two  algorithms  with  the  low¬ 


est  computational  complexity,  namely,  the  loss  based 
and  the  utilization  based  algorithm,  also  have  the  best 
performance,  with  the  loss  based  algorithm  being  in 
general  the  most  accurate  except  when  the  number 
of  probes  and  the  loss  rate  are  both  small.  More¬ 
over,  both  algorithms  seemed  to  be  robust  and  exhibit 
good  convergence  in  real  traffic  simulations,  in  spite 
of  violation  of  the  independence  assumption  of  our 
model. 

Finally,  the  algorithms  described  in  this  paper  are 
each  based  on  a  different  performance  metric.  We 
are  currently  extending  our  work  by  studying  algo¬ 
rithms  which  fully  take  advantage  of  the  available 
measurements  by  possibly  integrating  the  different 
performance  metrics  we  have  here  separately  consid¬ 
ered. 
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Abstract —  The  use  of  end-to-end  multieast  traffic  measurements  has 
been  recently  proposed  as  a  means  to  infer  network  internal  characteristics 
as  packet  link  loss  rate  and  delay.  In  this  paper,  we  propose  an  algorithm 
that  infers  the  multicast  tree  topology  based  on  these  end-to-end  measure¬ 
ments.  Differently  from  previous  approaches  which  make  only  partial  use 
of  the  available  information,  this  algorithm  adaptively  combines  different 
performance  measnres  to  reconstruct  the  topology.  We  establish  its  consis¬ 
tency  and  evaluate  its  accuracy  through  simulation.  We  show  that  in  gen¬ 
eral  it  requires  many  fewer  probes  to  correctly  identify  the  topology  than 
other  methods. 

Keywords.  End-to-end  measurements,  Topology  Discovery, 
Adaptive,  Estimation  Theory,  Multicast  Tree. 

I.  Introduction 

Background  and  Motivation.  As  communications  networks 
grows  in  size  and  complexity,  it  has  become  increasingly  im¬ 
portant  to  measure  their  performance.  To  overcome  the  limi¬ 
tations  imposed  by  administrative  diversity  which  de  facto  pre¬ 
vents  general  direct  access  to  large  portions  of  the  network,  there 
has  been  increasing  interest  in  approaches  that  aim  to  character¬ 
ize  the  network  internal  behavior  from  the  sole  external  end-to- 
end  measurements.  Currently,  there  are  several  measurements 
infrastructure  projects  (including  CAIDA  [2],  Eelix  [9],  IPMA 
[10],  NIMI  [15]  and  Surveyor  [18])  that  collect  and  analyze  end- 
to-end  measurements  across  a  mesh  of  paths  between  hosts. 

In  these  approaches,  a  fundamental  design  issue  is  the  type 
of  measurements  to  be  performed  across  the  network  and  the 
methodology  adopted  to  infer  the  internal  network  behavior 
in  terms  of  the  performance  experienced  by  the  measurements 
hosts.  A  promising  approach,  MINC  {Multicast  Inference  of 
Network  Characteristics),  relies  on  the  use  of  multicast  end- 
to-end  measurements.  In  contrast  to  unicast  traffic,  multicast 
traffic  introduces  a  well  structured  correlation  in  the  end-to-end 
behavior  observed  by  the  receivers  that  share  the  same  multicast 
session.  This  in  turn  allows  to  draw  inferences  about  the  perfor¬ 
mance  characteristics  of  the  internal  links  without  the  coopera¬ 
tion  of  network  elements  in  the  path  such  as  packet  loss  rates, 
[3],  packet  delay  distributions,  [11],  and  packet  delay  variance, 
[6].  There  is  ongoing  work  [1]  to  incorporate  some  of  these 
techniques  into  the  NIMI  measurements  infrastructure. 

All  these  inference  methods  require  knowledge  of  the  mul¬ 
ticast  tree  topology.  Unfortunately,  this  is  typically  unknown. 
This  motivates  the  need  for  algorithms  that  can  identify  the 
topology  of  the  tree.  Another  motivation  is  that  knowledge  of 
the  multicast  topology  can  be  of  use  to  multicast  applications. 
There  are  several  reliable  multicast  protocols  (e.g.,  RMTP[14]) 
which  organize  receivers  in  logical  hierarchies  using  the  under- 

*  This  work  was  supported  by  in  part  by  DARPA  and  the  AFL  under  agreement 
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lying  topology,  if  possible.  Other  applications  attempt  to  iden¬ 
tify  receivers  that  share  the  same  network  bottleneck  [16]. 

Several  algorithms  have  been  proposed  for  identifying  multi¬ 
cast  topologies  based  on  the  sole  loss  observations  at  receivers. 
An  algorithm  for  inferring  the  topology  of  a  binary  tree  was  first 
proposed  in  [16].  The  main  idea  was  the  simple  observation 
that  as  the  number  of  packets  grows  multicast  receivers  shar¬ 
ing  a  longer  portion  of  the  multicast  distribution  tree  also  have 
higher  shared  loss  rates;  this  information  could  in  turn  be  used 
to  reconstruct  the  topology  by  recursively  grouping  the  pair  of 
nodes  with  the  highest  shared  loss.  In  [8]  the  correctness  this 
algorithm  was  proven  and  the  approach  was  extended  to  general 
topologies  by  introducing  several  other  loss-based  algorithms. 
More  recently,  algorithms  have  been  proposed  for  identifying 
multicast  topologies  based  on  delay  measurements  instead.  By 
observing  that  the  approach  in  [16]  and  [8]  can  be  generalized 
to  any  performance  measures  that  (i)  monotonically  increases  as 
the  packet  traverse  the  tree,  and  (ii)  can  be  estimated  on  the  sole 
basis  of  end-to-end  measurements  at  the  receivers,  in  [7]  several 
algorithms  are  specified  based  on  delay  performance  measures 
as  link  utilization,  delay  average  and  delay  variance. 

The  accuracy  of  these  approaches  is  limited  by  the  fact  that 
each  of  the  above  algorithm  reconstructs  the  topology  using  only 
the  information  provided  by  one  single  performance  measure, 
e.g.,  loss  rates  or  delay  averages,  thus  making  only  partial  use 
of  the  available  measurements.  In  addiction,  as  shown  in  [7], 
no  algorithm  appears  to  perform  better  than  the  others  in  gen¬ 
eral.  Our  experience  has  shown  that  typically  under  moderate 
and  heavy  load  network  conditions  (high  link  loss  and  utiliza¬ 
tion)  the  loss  based  algorithm  is  generally  the  most  accurate 
while  under  light  load  condition  (low  link  loss  and  utilization), 
the  algorithm  based  on  link  utilization  performs  best.  Therefore, 
it  is  then  not  clear  which  algorithm  could  be  best  suited  to  re¬ 
construct  multicast  topologies  across  large  internetworks  where 
different  portions  of  the  network  can  experience  quite  different 
conditions.  In  the  most  general  case,  the  different  algorithms 
could  yield  quite  different  reconstructed  topologies;  clearly,  a 
method  which  allows  to  choose  among  them  or  better  to  com¬ 
pose  them  is  much  desired. 

Contribution.  In  this  paper  we  propose  a  new  algorithm  for  iden¬ 
tifying  multicast  topologies  based  on  joint  loss  and  delay  mea¬ 
surements  at  the  receivers.  This  algorithm  combines  the  differ¬ 
ent  performance  measures  and  reconstruct  the  tree  by  adaptively 
choosing  step  by  step  that  which  insures  the  best  accuracy.  In¬ 
tuitively,  by  so  doing  we  compose  the  topologies  each  perfor¬ 
mance  measure  would  yield  by  choosing  for  each  portion  of  the 
tree  its  more  accurate  reconstruction. 


The  key  contribution  underlying  this  approach  is  the  ability  to 
determine  which  performance  measure  minimizes  the  probabil¬ 
ity  of  making  an  error.  We  propose  a  technique  for  estimating 
the  probability  of  incorrect  identification  of  the  topology.  This 
is  accomplished  by  a  careful  enumeration  of  all  the  possible  er¬ 
roneous  decisions  and  by  estimating  the  probability  of  each  of 
them.  We  also  analyze  the  modes  of  misclassification  and  ver¬ 
ify  that  our  estimate  converges  to  the  true  error  probability  as 
the  number  of  packets  increases.  Therefore  we  can  use  this  esti¬ 
mate  to  determine  the  level  of  accuracy  of  a  given  reconstructed 
topology,  or  more  importantly,  the  number  of  probe  packets  re¬ 
quired  to  achieve  a  desired  level  of  accuracy. 

We  establish  that  the  joint  algorithm  is  consistent,  i.e.  the 
probability  of  correctly  identifying  the  topology  converges  to  1 
as  the  number  of  probes  grows  to  infinity.  Analysis  of  a  simple 
scenario  shows  that  the  joint  algorithm  can  significantly  outper¬ 
form  any  of  the  algorithms  previously  considered.  We  also  use 
simulation  to  evaluate  its  accuracy.  In  all  the  scenarios  consid¬ 
ered,  we  find  that  the  joint  algorithm  has  the  best  performance, 
requiring  in  general  many  fewer  probes  to  correctly  identify  the 
topology  than  other  methods. 

In  this  paper,  we  will  restrict  our  attention  to  topology  infer¬ 
ence  based  solely  on  loss  and  utilization  performance  measures. 
A  first  reason  is  simplicity;  as  later  shown,  the  loss  process  and 
the  utilization  process  are  formally  identical  once  we  substitute 
the  event  of  “packet  not  lost”  with  the  event  of  “packet  not  de¬ 
layed”;  as  a  consequence  the  very  same  results  apply  in  both 
cases.  A  second  reason  is  that  they  also  have  the  lowest  com¬ 
putational  complexity.  Finally,  they  are  the  most  accurate:  as 
previously  mentioned,  in  most  cases,  either  the  loss  based  or  the 
utilization  based  algorithms  has  the  best  performance.  Hence, 
while  the  joint  algorithm  extends  to  accommodate  other  perfor¬ 
mance  measures,  in  practice  most  of  the  benefit  is  achieved  by 
combining  the  loss  and  utilization  estimators. 

Implementation  Requirement.  In  contrast  to  loss,  delay  measure¬ 
ments  require  the  deployment  of  measurements  hosts  with  syn¬ 
chronized  clocks.  Global  Positioning  System  (GPS)  which  is 
used  in  some  of  the  mentioned  measurements  infrastructures  al¬ 
lows  accuracy  within  tens  of  microseconds.  This  is  sufficient 
for  accurate  utilization  measurements  which,  in  particular,  re¬ 
quire  the  accurate  assessment  of  the  minimum  end-to-end  delay. 
We  believe  this  is  not  the  case  for  the  more  widely  deployed 
Network  Time  Protocol  [12],  which  only  provides  accuracy  on 
the  order  of  tens  of  milliseconds. 

Structure  of  the  Paper.  The  rest  of  the  paper  is  organized  as  fol¬ 
lows.  In  Section  II  and  III  we  review  our  model  and  the  loss  and 
utilization  topology  inference  algorithms  In  Section  IV  we  in¬ 
troduce  the  joint  loss/utilization  algorithm;  we  also  describe  the 
technique  for  estimating  the  probability  of  topology  misclassifi¬ 
cation.  In  Section  V  we  analyze  the  performance  of  the  differ¬ 
ent  algorithms.  Their  accuracy  is  then  evaluated  in  Section  VI 
through  simulation.  We  conclude  in  Section  VII;  some  proofs 
are  deferred  to  the  Appendix. 

II.  Model  &  Inference 

Tree  Model.  The  physical  multicast  tree  comprises  actual  net¬ 
work  elements  (the  nodes),  and  the  communication  links  than 


join  them.  The  logical  multicast  tree  comprises  the  branch 
points  of  the  physical  tree,  and  the  logical  links  between  them. 
The  logical  links  comprise  one  or  more  physical  links.  Thus 
each  node  in  the  logical  tree,  except  the  leaf  nodes  and  possibly 
the  root,  must  have  2  or  more  children.  We  can  construct  the 
logical  tree  from  the  physical  tree  by  deleting  all  links  with  one 
child  (except  for  the  root)  and  adjusting  the  links  accordingly  by 
directly  joining  its  parent  and  child. 

Let  T  =  (y,  L)  denote  a  logical  multicast  tree  with  nodes 
V  and  links  L.  We  identify  the  root  node  0  with  the  source  of 
probes,  and  R  C  V  will  denote  the  set  of  leaf  nodes  (identified 
as  the  set  of  receivers).  The  set  of  children  of  node  k  €  V  is 
denoted  by  d{k).  For  each  node  k,  other  than  the  root  0,  there 
is  a  unique  node  f{k),  the  parent  of  k,  such  that  {f(k),  k)  €  L. 
We  will  refer  to  the  link  (/(fc),  k)  as  simply  link  k.  We  shall 
define  f^ik)  recursively  by  f^ik)  =  /(/"~^(fc))  with  =  /. 
We  say  that  j  is  a  descendant  of  fc  if  fc  =  /"(j)  for  some  integer 
n  >  0,  and  write  the  corresponding  partial  order  in  V  as  k. 
a{i,j)  will  denote  the  minimal  common  ancestor  of  i  and  j  in 
the  ^-ordering.  For  k  €  V  we  let  T{k)  =  (V {k),L{k))  denote 
the  subtree  of  T  that  is  rooted  at  k,  and  set  R{k)  =  R  Cl  V{k). 

Delay  and  Loss  Model.  Probe  packets  are  dispatched  down  the 
tree  from  the  root  node  0.  With  multicast,  each  probe  arriving 
at  anode  k  gives  rises  to  copy  sent  to  each  child  node  of  k.  On 
each  link,  the  packet  is  either  lost,  or  transmitted  with  some  de¬ 
lay.  We  regard  the  delay  as  the  sum  of  two  components:  a  fixed 
propagation  delay,  and  a  variable  queueing  delay.  We  represent 
the  latter  by  a  random  variable  Zj.  G  [0,  oo]  that  specifies  the 
queueing  delay  encountered  by  a  packet  attempting  to  traverse 
link  k,  with  Zf.  =  oo  signifying  packet  loss.  By  convention 
Zq  =  0.  The  accrued  queueing  delay  for  the  path  from  the  root 
to  a  node  k  is  Yf.  =  ^k-  This  yields  the  property  that 

Yf.  =  00  for  a  packet  lost  on  some  link  between  node  0  and  fc; 
likewise  =  0  if  no  queueing  delay  is  encountered  on  any  link 
of  the  path. 

Let  ai{k)  =  P[Zi.  <  oo]  denote  the  probability  of  transmis¬ 
sion  on  link  k,  and  =  P[Zi.  =  0]  the  probability  of  trans¬ 

mission  with  no  queueing  delay.  A  tree  is  said  to  be  canonical  if 
for  all  links  fc,  0  <  <  ai{k)  <  1.  A  tree  can  be  reduced 

to  canonical  form  by  (i)  removing  each  link  k  for  which  with 
ai{k)  =  1  or  au(k)  =  1  and  identifying  its  endpoints;  and  (ii) 
pruning  all  subtree  descended  from  links  that  have  ai(k)  =  0 
or  au(k)  =  0.  Henceforth  we  work  exclusively  with  canonical 
trees;  only  for  these  are  the  link  characteristics  uniquely  identi¬ 
fiable. 

Loss  and  Utilization  Processes.  Here  it  suffices  to  analyze  a 
projection  of  the  delay  processes  Z]..  For  each  fc  G  V  let 
Xi{k)  =  l{y(fc)<oo}-  We  call  Xi  =  {Xi{k))kev  the  loss 
process:  Xi(k)  =  1  if  the  probe  reaches  k  and  0  other¬ 
wise.  For  each  fc  G  V  let  X^ik)  =  We  call 

Xu  =  {Xu{k))kev  the  utilization  process:  Xu{k)  =  1  if  the 
probe  reaches  k  with  no  queueing  delay,  and  0  otherwise.  The 
name  arises  since  link  queueing  delay  is  zero  iff  the  link  is  not 
utilized:  1  —  is  hence  the  link  utilization. 

We  assume  the  Zf.  are  independent  random  variables.  Then 
Xu  and  Xi  are  Markov  processes  on  T.  Their  structure  is  for- 


176 


mally  identical.  The  loss  process  satisfies 

X^O)  =  1;  Xiifik))  =  0  ^  Xi{k)  =  0; 

P[Xi{k)  =  l\Xi{f{k))  =  l]=ai{k).  (1) 

The  utilization  process  is  formally  identical  upon  replacing  the 
event  of  “no  loss”  with  that  of  “no  delay”.  Then  (1)  holds  when 
Xi,  ai  are  replaced  by  In  the  rest  of  the  paper  we  will 

omit  the  subscripts  I  and  u  when  the  same  statement  holds  for 
both  cases. 


Inference  of  Shared  Path  Characteristics.  When  probes  are  sent 
down  the  tree  we  cannot  observe  the  entire  processes  X  but  only 
the  outcomes  at  the  receivers  {X{k))keR-  By  exploiting  the 
correlation  of  multicast  traffic,  in  [3]  it  was  shown  how  the  link 
loss  rates  can  be  computed  from  the  distribution  of  {X (k)) keR 
when  the  topology  is  known.  Here,  to  infer  the  topology,  we 
will  use  the  following  generalization  of  the  results  in  [3]. 

Let  A{k)  =  Hj;-*  denote  the  probability  that  a  probe 
reaches  node  k  (the  Ai{k)  version)  or  reaches  is  without  queue¬ 
ing  delay  (the  Au(k)  version).  A  short  probabilistic  argument 
shows  that  for  any  two  nodes  i  and  j,  a{i,j), 


A(k)  =  A(i,j) 


Pi'J  leR(i)^{(-)  =  leRU)^{()  =  1] 


(2) 


where  k  =  a(i,j).  (2)  expresses  the  behavior  along  the  shared 
portion  of  the  path  from  the  source  to  a  pair  of  nodes  in  terms  of 
the  probabilities  of  leaf-measurable  events. 

To  infer  the  probabilities  from  measurements,  consider  an 
experiment  in  which  a  set  of  n  probes  is  dispatched  from  the 
source.  From  the  outcomes  {x^^\  . . . ,  with  = 

{k))keR^  we  can  estimate  A{k)  by  substituting  the  prob¬ 
abilities  in  (2)  by  their  empirical  means,  obtaining 


1.  Input:  The  set  of  receivers  R=  {ii, . . .  Ar} 

2.  R'  :=  R;  V  :=  R';  L'  =  0  ; 

3.  while  \R'\  >  1  do 

4.  U  :=  select  pair  ; 

5.  V  :=V'U  {U}; 

6.  L' :=L'U{(C/,£)  :£g  [/}; 

7  a{e)  =  A{e)/A{j,k),e£U-, 

8.  R'  :=  {R'  \U)U  {[/}; 

9.  enddo 

10.  V  :=  V  U  {0}  ;L'=L'U  {(0,  i?')}  ; 

11.  Output:  tree  {V',L')  ; 

12.  procedure  select  pair 

13.  return  U  =  {j,  k}  C  R'  with  minimal  A{j,  fc); 

14.  end  procedure 


Fig.  1.  Deterministic  Binary  Tree  Classification  Algorithm  (DBT). 


III.  Loss  AND  Utilization  Topology  Inference 

Deterministic  Reconstruction  of  Binary  Trees.  Our  approach 
to  loss  (or  utilization)  topology  inference  relies  on  being  able 
through  (2)  to  identify  the  characteristics  along  internal  paths  of 
the  multicast  tree  from  the  probability  of  measurable  events  at 
receivers.  The  key  observation  is  that  a(j,  fc)  -<  a{j',k')  im¬ 
plies  A{j,  k)  <  A{j' ,  k'),  from  which  it  follows  that  the  pair 
{j,  k}  C  R  which  has  minimal  A{j,  k)  is  a  sibling  pair;  a  short 
argument  shows  that  if  not,  A{j,  k)  would  not  be  minimal.  The 
idea  is  to  proceed  recursively,  starting  from  the  receivers,  by 
adding  the  parent  node  as  sibling  are  identified.  This  approach 
is  formalized  in  the  Deterministic  Binary  Tree  Classification  Al¬ 
gorithm  (DBT);  see  Figure  1. 

DBT  operates  as  follows.  R'  denotes  the  current  set  of  nodes 
from  which  a  pair  of  siblings  will  be  chosen,  initially  equal  to 
the  receiver  set  R.  We  first  use  the  procedure  select  pair  below 


(3) 


procedure  select  pair 

return  U  =  {j,  k}  C  R'  with  minimal  A{j,  fc); 

end  procedure 


where  we  define  :=  '^i£ii{k)X^"'\l).  It  is  possible  to 

show  that  =  {A^^\i,  j))ij^v  is  consistent  (A("i  ^ 

with  probability  1)  and,  as  n  goes  to  infinity,  ^/n(A^‘^^  —  A)  con¬ 
verges  in  distribution  to  a  multivariate  Gaussian  random  variable 
with  mean  0  and  covariance  matrix  a  a  =  crA{.A).  Details  can 
be  found  in  [8]. 

A  complication  arises  in  case  of  utilization  estimation  as  we 
have  to  account  for  (i)  the  presence  of  the  fixed  delay  compo¬ 
nent  in  the  experimental  data  due  to  propagation  delays  and  (ii) 
the  inherent  limitation  of  time  measurements  accuracy  due  to 
clocks  resolution.  To  this  end,  we  (i)  normalize  each  measure¬ 
ment  by  subtracting  the  minimum  delay  seen  at  the  leaf  and  (ii) 
introduce  a  tolerance  r  (typically  smaller  than  1ms)  in  deciding 
whether  a  given  delay  is  a  “minimum”  delay.  In  other  words,  op¬ 
erationally  we  define  (fc)  = 

where  Y^™">{k)  is  the  delay  experienced  by  the  probe  sent 
to  receiver  k.  This  amounts  to  assign  the  observed  minimum  de¬ 
lay  as  the  propagation  delay,  under  the  assumption  that  at  least 
one  probe  has  experienced  no  queueing  delay  along  the  path. 


to  find  the  pair  U  =  {j,  k}  that  minimizes  A{j,  k)  (line  4).  This 
identifies  the  members  of  U  as  siblings,  and  the  set  U  is  used 
to  represent  their  parent.  Correspondingly,  we  add  U  to  the  list 
V  of  nodes  (line  5),  {U,j),  {U,  k)  to  the  list  L'  of  links  (line  6), 
compute  a{j)  and  a{k)  by  taking  the  appropriate  quotient  (line 
7)  and  replace  j  and  fc  by  [/  in  the  set  R'  of  nodes  available  for 
pairing  in  the  next  stage  (line  8).  This  process  is  repeated  until 
all  sibling  pairs  have  been  identified  (loop  from  line  3).  Finally, 
we  adjoin  the  root  node  0  and  the  link  joining  it  to  its  single 
child  (line  10). 

We  say  that  DBT  reconstructs  the  binary  logical  multicast  tree 
(y,  L)  if  given  the  receiver  set  R  it  produces  (V,  L)  as  its  output. 
Theorem  1:  Let  T  be  a  binary  tree.  Then  DBT  reconstructs 

r. 

We  postpone  the  proof  to  the  Appendix. 

Reconstruction  of  Binary  Trees  from  Measurements.  It  is 
straightforward  to  derive  from  DBT  an  algorithm  that  es¬ 
timates  the  topology  from  the  end  to  end  measurements 
,  ■  ■  ■ ,  x*^"^ ).  The  idea  is  to  estimate  T  by  the  topology 
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obtained  by  using  the  estimates  {j,  k)  in  place  of  A{j,  k). 
This  amounts  to  modifying  the  procedure  select  pair  as  follows 

procedure  select  pair 

return  U  =  {j,  k}  C  R'  with  minimal  A^^\j,  fc); 

end  procedure 

Computation  of  A^'^\j,  k)  is  accomplished  via  (3);  to  this  end, 
observe  that  so  they  can  be  recur¬ 

sively  computed  as  the  tree  is  reconstructed.  It  therefore  suffices 
to  add  the  line 

4a.  foreach  m  =  1, . . . ,  n  do  V  X^"^\k)-, 

We  call  the  resulting  algorithm  the  Binary  Tree  Classification 
Algorithm  (BT). 

Theorem  2:  With  probability  1,  =  T  for  sufficiently 

large  n.  Hence  is  a  consistent  estimator  of  T,  i.e., 

lim„^^  P[r(")  ^  T]  =  0. 

Proof  of  Theorem  2:  Since  {j,  k)  converges  almost  almost 
surely  to  A{j,k),  then,  with  probability  1,  for  all  sufficiently 
large  n,  the  relative  ordering  of  the  {j,  k)  is  the  same  as  that 
of  the  A(7,  k)  for  pairs  j,  k  for  which  the  A{j,  k)  are  distinct. 
Hence,  for  all  n  sufficiently  large,  BT  reconstructs  the  tree  in 
the  same  manner  as  DBT,  except  possibly  varying  the  order  in 
which  it  groups  pairs  {j,  k}  with  identical  A{j,  k).  The  last  two 
statements  then  directly  follow  by  standard  results,  g 

Finally,  observe  that  in  line  7  BT  computes  an  estimate 
(j(")(£)  =  A*^"^(C/)/a*'"^(7)  of  a{£).  From  Theorem  2  then  it 
immediately  follows  that  as  n  goes  to  infinity  {£)  converges 
with  probability  1  to  a{£). 

Extension  to  General  Trees.  Inference  of  general  trees  can  be 
accomplished  by  extending  BT.  In  [8]  we  propose  and  analyze 
different  alternatives.  The  simplest  approach,  which  also  turns 
out  to  be  the  most  computationally  efficient  and  accurate,  pro¬ 
ceeds  in  two  steps:  first  it  reconstructs  a  binary  tree  using  BT; 
then  it  applies  a  threshold  £  and  prune  all  links  k  such  that 
a^^\k)  >  1  —  £.  The  idea  comes  from  the  observation  that 
the  application  of  DBT  to  an  arbitrary  tree  results  in  a  binary 
tree  in  which  all  links  k  which  do  not  exists  in  the  original  tree 
satisfy  a{k)  =  1.  In  BT,  the  use  of  a  threshold  e  accounts  for 
the  statistical  variability  of  the  estimates. 

IV.  A  Joint  Loss-Utilization  Algorithm 

We  now  extend  the  framework  for  topology  inference  by 
proposing  an  algorithm  which  combines  loss  and  utilization 
measurements.  We  contrast  this  to  BT  which  is  based  on  a  single 
performance  measure.  The  idea  consists  in  reconstructing  the 
topology  by  adaptively  choosing  at  each  step  the  performance 
measures  which  insures  the  best  accuracy.  We  describe  the  al¬ 
gorithm  below.  The  algorithm  bases  its  decisions  on  estimates 
of  the  probability  of  misclassification.  In  the  remainder  of  the 
section  we  will  present  a  technique  for  estimating  this  probabil¬ 
ity. 

The  Joint  Loss-Utilization  Classification  Algorithm.  The  joint 
algorithm  proceeds  like  BT  by  recursively  grouping  nodes  start¬ 
ing  from  the  set  of  receivers.  Differently  from  BT,  here  we 
choose  at  each  step  the  performance  measure  on  which  to  base 


the  grouping  decision;  more  precisely,  at  each  step  we  determine 
the  two  pairs  that  minimize  A^'^\{., .)  and  A^^\{., .)  and  group 
that  which  also  minimizes  the  probability  of  making  an  error. 
Specifically,  we  modify  the  procedure  select  pair  as  follows 

procedure  select  pair 
foreach  X  G  {l,u} 

select  Ux  =  {jx,  kx}  Q  R'  with 
minimal  A^")  x  {jx  ,kx)', 
return  U  =  {j,  k}  =  argmin^^._^ 
end  procedure 

where  Px^^)  denotes  the  (estimated)  probability  of  misclassi¬ 
fication,  given  the  current  set  of  nodes  R',  pairing  nodes  accord¬ 
ing  to  performance  measure  X.  We  will  detail  how  to  compute 
this  estimate  in  Section  IV-A. 

We  call  the  resulting  algorithm  the  Joint  Binary  Tree  Classifi¬ 
cation  Algorithm  (JBT).  Denote  the  topology  obtained  by 
JBT. 

Theorem  3:  With  probability  1,  =  T  for  sufficiently 

large  n.  Hence  is  a  consistent  estimator  of  T,  i.e., 

lim„^oo  P[7;.^")  ^  T]  =  0. 

We  formalize  the  proof  in  the  Appendix.  The  intuition  beyond 
the  proof  is  that,  for  all  sufficiently  large  n,  with  probability 
1,  the  relative  ordering  of  the  k)  is  the  same  as  that  of 

A(7,  k)  (which  observe  can  be  different  for  loss  and  utilization) 
from  which  it  follows  that  the  two  pairs  of  nodes  which  mini¬ 
mize  Ai{., .)  and  A„(., .)  are  both  siblings  pairs. 

Extension  to  General  Trees.  Inference  of  general  trees  is  ac¬ 
complished  by  reconstructing  a  binary  tree  using  JBT  first  and 
by  then  pruning  all  links  k  such  that  a^^\k)  >  1  —  Si  and 
oiu\k)  >  1  —  Eu,  where  we  use  possibly  different  loss  and 
utilization  thresholds,  ei  and  £„.  The  estimates  a^^\k)  and 
(k)  are  computed  in  line  7  of  JBT  by  taking  the  appropriate 
ratio. 

A.  Estimation  of  the  Misclassification  Probability 

In  this  section  we  describe  the  estimate  of  the  probability  of 
misclassification  that  is  used  in  JBT.  Classification  proceeds  by 
a  sequence  of  comparison  operations;  the  analysis  of  misclassi¬ 
fication  is  therefore  potentially  complex  due  to  the  need  to  ana¬ 
lyze  a  large  number  of  statistically  dependent  modes  of  failure. 
Our  approach  to  this  is  to  divide  and  conquer.  Correct  classifica¬ 
tion  requires  correct  ordering  of  quantities  A(7,  k)  in  a  number 
of  comparison.  For  each  such  comparison,  we  approximate  the 
probability  of  incorrect  ordering  in  terms  of  the  tail  probability 
of  a  Gaussian  random  variable  whose  variance  we  calculate.  For 
large  numbers  of  probes,  the  probability  of  misclassification  is 
dominated  by  the  largest  such  misordering  probability. 

The  generic  comparison  involves  three  nodes  j,  k  and  I,  where 
a{fk)  ^  a{j,l).  Since  A  a{j,l)  iff  A(j,fc)  <  A{j,l), 

the  correct  descendency  relation  between  a{j,  k)  and  a{j,  1)  is 
identified  if 

{j,  k,l):=  A(")  {j,  1)  -  A(")  {j,  k)  (4) 

has  the  same  sign  as  its  deterministic  counterpart  D{j,  k,  1)  = 
A{j,  1)  —  A{j,  k).  Let  Q{j,  k,  1)  denote  this  event. 
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The  following  theorem,  essentially  proved  for  loss-based 
classification  in  [8],  characterizes  the  asymptotic  behavior  of 
(i,  k,  1)  first  for  large  n,  then  for  small  loss  and  delays.  De¬ 
note  a  =  1  —  a  and  let  s(fc)  :=  a{k). 

Theorem  4:  For  each  triple  {j,k,l),  ^/n  ■  k,l)  — 

D{j,  k,  /)),  (j,  fc,  1),  converges  in  distribution,  as  the  number  of 
probes  n  — oo,  to  a  Gaussian  random  variable  with  mean  0  and 
variance  k,  1).  Moreover,  as  || a||  =  maxj^gy  a(k)  — >■  0: 

(i)  D{j,k,l)  =  s{a{j,l))  -  s{a{j,k))  +0{\\a\f)-, 

(ii)  (T‘^{j,k,l)  =  \s{a{j,l))  -  s{a{j,k))\  -b  0(||  a  ||^); 
Measurements  yield  the  statistic  D^^\j,k,l)  with  which  to 

infer  the  descendency  relations.  From  this  we  would  infer 
dihk)  >-  a{k,l)  if  and  only  if  D^'^\j,k,l)  >  0.  Misorder- 
ing  occurs  when  k,  1)  and  {j,  k,  1)  have  opposite  signs. 
For  large  n,  Theorem  4  suggests  the  following  approximation 
for  the  probability  of  misordering 


P[Q%j,k,l)] 


T- 


(^{j,k,l)  J 


(5) 


where  T"  is  the  cdf  of  the  standard  normal  distribution.  Since 
D{j,  k,  1)  and  (j,  k,  1)  are  unknown,  we  need  to  estimate  them 
first.  The  idea  is  to  simply  estimate  D{j,  k,  1)  by  k,  1). 

For  the  variance,  we  use  the  fact  that  cr‘^{j,k,l)  is  a  con¬ 
tinuous  function  Vji-i  of  A,  (j^{j,k,l)  =  /)]  -I- 

Var[^(")(j,fc)]  -  =  {(JA(j,i)(j,i)  + 

-  ‘2(^A{j,i){j,k))/n  =  Vjki{A),  and  estimate  it  by 
O'*'"  ^(i;  k,  1)  =  We  thus  approximate  the  probabil¬ 
ity  of  incorrect  ordering  k,  /)]  by 


a(^\j,k,l)  J 


(6) 


where  we  used  in  place  of  D{j,  k,  1)  and  k,  1)  their  esti¬ 
mates.  The  accuracy  of  (6)  relies  on  the  convergence  of  the 
estimates  k,  1)  and  k,  1).  We  will  verify  this  in 

Section  VI. 


Misclassification  Probability  Estimate.  Consider  now  the  £-th 
step  of  JBT(or  BT)  and  denote  by  the  current  set  of  nodes 
and  {jn,  k„}  C  the  pair  with  minimal  A^'^\jn,  k„).  This 
pair  is  chosen  on  the  basis  of  the  orderings  (j,  k,l)  >  0 
for  each  triple  {j,k,l)  G  =  {{jn,  k„),  {k„,  j„)}  x 

\  {jn,  kn}).  With  each  such  ordering  we  associate  a  mis¬ 
ordering  probability  as  in  (6).  From  the  union  bound 

P[U5(^(r.))Q=(i,fc,/)]  <  we  associate 

with  the  selection  of  {jn,kn)  an  estimated  misclassification 
probability  through  the  sum 


R\ 


(n) 


'sp  p/.(") 

2^  ^jkl 

(j,k,i)eS(R^P^) 

max 


T- 


(— \/n  min 


\D^^\j,k,l)\\ 
a^^\j,k,l)  j 


(7) 

(8) 

(9) 


This  is  the  misclassification  estimate  we  use  in  JBT.  The  ap¬ 
proximation  arises  because  for  large  n,  the  term  with  the  small¬ 
est  argument  \D^'^'>{j,  k,  l)\lo^'^\j,  k,  1)  will  dominate  the  rest. 


Observe  that  P*"!),  D^‘"\j,k,l)  and  a^'^\j,k,l)  can  be  di- 

rectly  computed  from  fc):  {j,k}  G  Further¬ 

more,  when  selecting  between  the  loss  and  utilization  methods 
during  step  £,  we  need  only  select  that  with  the  smallest  com¬ 
posite  argument  k,  k,  1). 

Topology  Misclassification  Probability  Estimate.  (7)  associates 
a  misclassification  probability  estimate  with  a  single  grouping 
decision.  Using  a  simple  union  bound  argument,  we  can  also 
associate  a  misclassification  probability  estimate  with  the  en¬ 
tire  reconstructed  topology  T*"^.  In  JBT,  since  we  group  the 
pair  of  nodes  which  yields  the  smallest  P^ji"\  we  can  estimate 

the  topology  misclassification  probability  by  summing  over  the 
minimum  between  the  loss  and  utilization  misclassification  es¬ 
timates. 


p/.(") 

J 


in-Ri-i 


y  min{p/’y,p/’*7y. 

e=i 


(10) 


It  is  easy  to  realize  that  we  can  also  associate  a  misclassification 
probability  estimate  to  the  topology  inferred  by  BT.  The  differ¬ 
ence  is  that  it  is  simply  computed  by  summing  over  (7),  i.e., 
p7’("l  :=  El=i^^  ^  P^(i"^-  In  Section  VI  we  will  illustrate 
applications  of  these  estimates. 


V.  Analysis  of  Classifier  Performance 
A.  Performance  of  Single  Classifier  using  BT 

The  analysis  of  the  actual  misclassification  probabilities  mir¬ 
rors  much  of  the  previous  analysis.  Consider  a  node  i  €  V 
which  is  to  be  identified  during  the  step  £  of  BT.  Let  h{i) 
and  h*{i)  denote  its  two  children.  Correct  identification  of 
i  occurs  if  neither  h{i)  nor  h*{i)  is  incorrectly  paired  with 
some  other  element  of  R^,  the  set  of  nodes  available  for 
pairing  at  step  £.  Thus,  the  event  of  correct  classification 
at  step  £  is  Qe  =  riQ^k,i)eS{Rt)Q{j,k,l)  where  S{Re)  = 
{{h{i),h*{i)),{h*{i),h{i))}  x  {R^  \  {h{i),h*{i)}).  Correct 
classification  of  the  whole  tree  is  the  event  Q  = 

Now,  the  various  Q{j,  k,  1)  are  not  independent  events,  and 
neither  are  the  Q^.  However,  we  can  use  union  bounds  to  bound 
above  the  probability  of  misclassification: 


in-Ri-i 


pf 

:=  P[Q^]< 

y  where 

i=l 

(11) 

pf 

:=  P[Qf\< 

y  pw{j,k,i)] 

(12) 

{j,k,i)eS{R,) 


According  to  Theorem  4,  then  for  large  n,  these  sums  will  be 
dominated  by  the  expression  ^{—y/n^)  where 


13 


in-Ri-i 

min 

i=i 


D\j,k,l) 

“111  9/  ■  , 

{j,k,i)es{Ri)  a^{j,k,l) 


(13) 


For  large  n,  the  approximation  for  log  P7  is  asymptotically  lin¬ 
ear  in  n  with  negative  slope  /)/2.  A  simple  approximation  is 
thus  Pf  « 
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If  we  consider  the  asymptotic  regime  of  small  loss  and  delay, 
a  II  —>^0,  from  relations  (i)  and  (ii)  in  Theorem  4  it  follows  that 


DHj,k,l) 

“111  9/  ■  , 

U,k,i)eS{Ri) 


=  a{i)  +  0(11  a  I 


the  minimum  being  attained,  for  small  enough  ||a||,  where 
a{j,  k)  =  i  and  a{j,  1)  =  f(i).  Picking  the  dominant  contribu¬ 
tion  to  (11)  then  j3  «  infjgy\fla(i)  yielding 
Thus,  in  this  regime,  the  probability  of  correctly  identifying  the 
topology  is  controlled  by  the  smallest  loss  rate  or  link  utiliza¬ 
tion. 

The  above  argument  can  be  formalized  using  Large  De¬ 
viation  theory.  However,  calculation,  of  the  decay  rate  ap¬ 
pears  computationally  infeasible,  although  the  leading  exponent 
inf  a(i)  can  be  recovered  in  the  small  ||  a  ||  regime. 


a 

Fig.  3.  Three-receiver  tree.  Asymptotic  slope  of  misclassification  proba¬ 
bility  for  a  single  classifier,  as  function  of  uniform  link  probability  a 

1 

0.8 

0.6 
a 

0.4 

0.2 

0 

0  0.2  0.4  0.6  0.8  1 


B.  Comparative  Performance  of  Loss  and  Utilization-Based 
Classifiers 

As  an  example  we  consider  the  three  receiver  tree  with  uni¬ 
form  link  probabilities  and  afik)  =  af,  see  Fig¬ 

ure  2.  The  topology  is  correctly  inferred  when  nodes  4  and 
5  are  grouped  together;  this  requires  5)  <  A(")(4,3) 

and  A*^")(5,4)  <  The  argument  controlling  the 

misclassihcation  probability  is  =  £>^(4,  5,  3)/ct^(4,  5, 3)  = 
D(5, 4,  3)^/ct^(5,  4, 3).  We  plot  this  as  a  function  of  the  com¬ 
mon  probability  a  in  Figure  3.  The  curve  is  approximately  lin¬ 
ear  in  a  for  small  a  =  1  —  a,  in  agreement  with  (14).  As  a 
increases,  reaches  a  maximum  at  about  a  =  0.2  {a  =  0.8), 
then  decreases  to  0.  Thus  in  this  homogeneous  tree,  the  misclas¬ 
sihcation  probability  is  minimized  when  a  «  0.2. 

We  compare  the  relative  performance  of  the  loss  and  utiliza¬ 
tion  classihers  in  Figure  4,  indicating  the  regions  where  each  of 
the  relevant  slopes  is  higher.  The  loss  classiher  is  best 

when  loss  rates  are  higher  than  about  0.2  (i.e.,  ai  <  0.8)  or 
when  utilization  is  high  {i.e.,  low  a„).  However,  it  is  outper¬ 
formed  by  the  utilization  classiher  when  there  is  low  utilization 
(i.e.  high  a„). 

C.  Performance  o/JBT 

In  this  case,  the  analysis  of  the  misclassihcation  probabil¬ 
ity  is  complicated  by  the  fact  that  JBT  uses  the  misclassihca¬ 
tion  estimates  to  take  grouping  decisions.  Here,  to  illustrate  its 
modes  of  misclassihcation  and  assess  its  relative  beneht  with 
respect  to  BT  we  analyze  the  performance  of  JBT  in  the  three 


«i 

Fig.  4.  Three-receiver  tree.  Partition  of  parameter  space(a( ,  a^)  where 
loss  or  utilization  estimator  has  better  performance,  i.e.  largest  asymptotic 
slope  for  misclassification  probability.  Note  aj  <  Uu- 

receiver  binary  tree  scenario  in  Figure  2  with  uniform  link  prob¬ 
abilities.  In  JBT,  the  topology  is  correctly  inferred  when  for 
the  chosen  performance  measure  ^("^(4,5)  <  A(")(4,  3)  and 
A*^"^(4,5)  <  A*^")(5,3).  To  keep  the  complexity  manageable, 
we  focus  on  the  hrst  event  and  assume  misclassihcation  occurs 
when  (4,  5)  >  A(")(4,  3),  i.e.,  when  D(")(4,  5, 3)  <  0. 

The  behavior  of  the  classiher  is  then  completely  character¬ 
ized  by  the  bivariate  random  variable  )  where 

x*^")  =  Fro“  (6),  the  misclassihcation  estimate 

for  both  performance  measures  is  P/53"^  =  \/n|x*^"^|); 

the  joint  algorithm  groups  the  nodes  based  on  loss  information 
when  |x^^"^|  >  |xi"^|  and  on  utilization  otherwise  (we  assume 
ties  are  resolved  in  favor  of  loss).  Misclassihcation  occurs  when 
the  chosen  performance  measure  results  in  grouping  the  wrong 
pair;  this  happens  when  |x|"^|  >  |xi"^|  and  x|"^  <  0  or  when 
|xi"^  I  >  |x|"^  I  and  xi"^  <  0  which  simply  amounts  to  the  con- 

(n)  (n) 

dition  x)  -\-Xu  <  0.  The  misclassihcation  probability  is  then 

P/  ;=  -b  <  0]  (15) 

Normal  Approximation.  We  now  consider  the  asymptotic  behav¬ 
ior  of  Pj  .  An  application  of  the  Delta  method  (see  Chapter  7  of 
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a. 

Fig.  5.  Joint  Classifier.  Contour  plot  of  the  ratio  of  the  (log-scale)  mis- 
classification  probability  asymptotic  slope  between  the  joint  and  best  basic 
classifier. 
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[17])  shows  that  as  n  oo,  —a;),  where  a;  =  {xi,Xu), 

X  =  with  continuous  7i,  converges  in  distri¬ 

bution  to  a  bivariate  Gaussian  random  variable  with  mean  zero 
and  covariance  matrix  ax  =  {V%{Ai),V%{Au))  ■  aAi,A^  ■ 
{VH{Ai),V'H{Au))^ ,  where  aAi,A^  the  asymptotic  covari¬ 
ance  matrix  of  ^/n  ■  (Ai,Au)  and  A  denotes  the  transpose. 
(o'Ai  ,Au  can  be  computed  generalizing  the  approach  used  in  [8] 
to  compute  a  a  •) 

Therefore,  we  have  the  following  approximation 


pf  , 

.  f 

-f  inf  (X^- 

'>-xA 

^  =0 

(17) 

where  for  large  n,  we  consider  the  leading  exponential  order. 
The  inhmum  in  (17)  is  =  {x'  —  x)  ■  ■  {x'  —  x)^,  where 

x'  =  {x'i,x'A)  =  (x'lT—x'i)  is  the  tangent  point  between  the 
line  =  0  and  the  ellipse  of  the  family  —  x)  ■ 

a^^  ■  {xA')  —  x)^  =  parameterized  in  a.  Thus,  as  n  goes  to 
inhnity  we  expect  the  curve  log  Pj  vs.  n  being  asymptotically 
linear  with  negative  slope  x|/2.  A  simple  approximation  is  then 

Pj  «  Moreover,  the  minimizing  pair  (x|"\xi"^)  = 

(xj,  —x'l)  indicates  that  misclassification  most  likely  occurs  by 
having  the  two  estimated  misclassification  probabilities  equal, 
loss  and  utilization  yielding  two  different  pairs  for  grouping,  and 
picking  the  wrong  pair. 

To  illustrate  the  results,  we  study  the  relative  performance  of 
JBT  by  comparing  the  asymptotic  slope  of  the  logarithm  of  the 
misclassification  probability  x|  with  that  of  the  best  single  clas- 
siher.  This  is  computed  by  considering  the  leading  exponential 
order  approximation  pf  ^  ^  ~  g-n*^/2  Qf 

the  misclassification  probability  in  BT.  Figure  5  shows  the  con- 
tour  plot  of  the  ratio  (log-scale)  asymptotic 

slopes  as  function  of  link  characteristics.  {ai,au)-  JBT  per¬ 
forms  better  than  either  version  of  BT  for  a  signihcant  range 
of  values  (the  region  within  the  contour  line  corresponding  to 
1).  The  performance  improvement  is  more  pronounced  in  the 


region  where  the  loss  and  utilization  classifiers  have  similar  per¬ 
formance  (which  corresponds  to  the  line  separating  the  two  re¬ 
gions  in  Figure  4)  and  loss  and  utilization  estimates  have  low 
correlation  (which  occurs  when  ai  ^  a„).  This  is  not  surpris¬ 
ing  since  we  expect  that:  (i)  little  improvement  can  be  achieved 
when  one  classiher  significantly  outperforms  the  other;  and  (ii) 
strong  correlation  offsets  the  benefits  of  using  both  loss  and  uti¬ 
lization  estimates. 

To  show  the  effect  of  correlation,  consider  the  case  xi  =  x„, 
i.e.,  when  the  loss  and  utilization  classifiers  have  the  same  per¬ 
formance.  In  this  case,  it  is  easy  to  verify  that  x|  =  j^xf, 

where  p  denotes  the  coefficient  of  correlation  of  x^^"^  and  xi"^ . 
At  one  extreme,  p  =  1  and  x|  =  xf,  i.e.,  Pj  =  P^:  we  have 
maximal  correlation  between  the  loss  and  utilization  classifiers 
and  JBT  cannot  provide  any  performance  improvement;  at  the 
other  extreme,  p  =  0  and  x|  =  2x^,  i.e.,  Pj  =  P/ Pi ■  we  have 
statistical  independence  and  the  probability  of  misclassihcation 
is  the  product  of  the  two  misclassification  probabilities. 

From  Figure  5  we  also  observe  that  JBT  does  not  always  pro¬ 
vide  better  performance.  In  this  example,  we  have  that  under 
very  high  or  very  low  utilization  the  loss  and  utilization  clas¬ 
sifiers,  respectively,  have  better  performance  than  JBT.  In  these 
cases,  because  of  the  high  variance  of  the  misclassification  prob¬ 
abilities  estimates,  JBT  is  likely  to  mistakenly  give  preference 
to  the  worst  performance  measure. 

VI.  Experimental  Evaluation 

In  this  section  we  evaluate  the  performance  of  JBT  and  com¬ 
pare  it  with  that  of  BT  through  two  types  of  simulation.  In  model 
simulations  delay  and  loss  are  chosen  to  follow  our  statistical 
model,  allowing  us  to  test  algorithm  performance  in  the  setting 
on  which  our  analysis  is  based.  Network  simulations,  using  the 
ns  [13]  simulator,  test  the  algorithms  in  a  more  realistic  setting, 
where  delay  and  loss  are  due  to  queueing  delay  and  buffer  over¬ 
flows  at  nodes  as  multicast  probes  compete  with  background 
TCP/UDP  traffic. 

Model  Simulation.  We  conducted  10000  experiments  over  ran¬ 
domly  generated  15  node  binary  trees.  In  Eigure  6,  we  plot  the 
fraction  of  incorrectly  classified  topologies  as  a  function  of  the 
number  of  probes  for  the  different  classifiers.  We  considered 
two  regimes:  a  light  load  regime  with  low  loss  (randomly  chosen 
between  1%  and  5%)  and  utilization  (randomly  chosen  between 
10%  and  40%),  and  a  heavy  load  regime  with  higher  loss  (ran¬ 
domly  chosen  between  1%  and  20%)  and  utilization  (randomly 
chosen  in  between  30%  and  80%). 

In  both  cases,  the  joint  classiher  dramatically  outperform  the 
loss  and  utilization  classihers  with  a  difference  in  accuracy  al¬ 
ready  of  more  than  one  order  of  magnitude  in  accuracy  for  just 
400  probes. 

The  accuracy  of  our  approach  to  joint  classihcation  lies  in  that 
of  the  misclassihcation  probability  estimates.  In  Eigure  6  we 
also  superimposed  the  mean  over  the  experiments  of  the  topol¬ 
ogy  misclassihcation  probability  estimates.  Erom  the  Eigure,  we 
observe  that  the  curves  well  track  the  actual  slopes,  bound  from 
above  the  actual  values  and  preserve  their  relative  order. 

We  can  use  the  topology  misclassihcation  probability  esti¬ 
mate  to  determine  the  number  of  probes  required  to  achieve  a 
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Fig.  6.  Model  Simulation.  Fraction  of  incorrectly  classified  topologies  and  misclassification  estimates  for  different  classifiers  as  function  of  number  of  probes: 
(a)  light  load  scenario;  (b)  heavy  load  scenario. 


JBT 

6 

0.05 

0.1 

0.2 

fract.  of  mis.  topologies 

0.003 

0.008 

0.032 

average  #  of  probes 

145 

117 

86 

BT  (loss) 

5 

0.05 

0.1 

0.2 

fract.  of  mis.  topologies 

0 

0 

0.011 

average  #  of  probes 

415 

318 

240 

TABLE  I 

Accuracy  of  the  Inferred  Topology.  Fraction  of  misclassified 

TOPOLOGIES  AND  AVERAGE  NUMBER  OF  DISPATCHED  PROBES  FOR 
DIFFERENT  VALUES  OF  S. 
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Fig.  7.  ns  Simulation  Topology. 


desired  level  of  accuracy  of  the  inferred  topology.  The  idea  is  to 
proceed  by  dispatching  probes  until  the  estimated  misclassifica¬ 
tion  probability  is  below  a  given  threshold  5  corresponding  to  a 
desired  level  of  accuracy.  Thus,  for  example,  to  insure  a  proba¬ 
bility  of  misclassification  no  greater  than  0.05,  we  send  probes 
until  <0.05. 

We  performed  1000  experiments  over  random  generated  15 
node  binary  trees.  In  each  experiment  probes  were  dispatched 
until  the  misclassification  probability  fell  below  a  given  thresh¬ 
old  5  and  we  verified  whether  the  inferred  topology  was  correct. 
For  JBT  and  BT  under  the  light  load  regime,  we  summarise  the 
results  in  Table  1  where,  for  different  values  of  6,  we  display  the 
average  number  of  probes  that  were  dispatched  and  the  fraction 
of  topologies  that  were  misclassified.  Since  the  estimate  bounds 
from  above  the  misclassification  probability,  it  is  no  surprise  that 
the  fraction  of  misclassified  topologies  is  well  below  the  chosen 
threshold.  Observe  that  the  number  of  probes  required  by  JBT 
is  about  one  third  of  those  required  by  BT  with  loss. 

Finally,  to  illustrate  the  benefit  of  combining  loss  and  utiliza¬ 
tion  measurements  we  compare  JBT  with  a  simpler  approach 
which  simply  consists  in  choosing  among  the  inferred  topolo¬ 
gies  separately  computed  with  the  loss  and  utilization  classifiers 


that  with  the  smallest  misclassification  probability  estimate.  De¬ 
note  the  topology  inferred  by  classifier  X,  X  £  {I,  u}  and 
its  estimated  probability  of  misclassification.  We  select 
TteJt  =  '4"^  where  Y  =  argmin^^g^^  In  Figure  6 

we  also  superimposed  the  fraction  of  times  4"^  was  incorrect. 
This  approach  yields  more  accurate  results  than  either  loss  and 
utilization  classifiers,  yet  not  as  accurate  as  JBT;  the  distance 
from  the  JBT  curve  quantifies  the  significant  gain  achievable  by 
the  adaptive  scheme  which  use  both  performance  measures;  the 
fact  the  two  curves  are  parallel  suggests  that  misclassification  is 
ultimately  dominated  by  the  same  event  in  both  cases. 

TCP/UDP  Network  Simulation.  The  ns  simulations  used  the 
topology  shown  in  Figure  7.  We  arranged  for  some  heterogene¬ 
ity  with  the  interior  links  having  higher  capacity  (5Mb/sec)  and 
propagation  delay  (50ms)  then  at  the  edge  (IMb/sec  and  10ms). 
Each  link  is  modeled  as  a  FIFO  queue  with  a  20-packets  buffer 
capacity. 

The  root  node  0  generates  probes  as  a  20Kbit/s  stream  com¬ 
prising  40  byte  UDP  packets  according  to  a  Poisson  process  with 
a  mean  interarrival  time  of  16ms.  The  background  traffic  com¬ 
prises  a  mix  of  infinite  data  source  TCP  connections  (FTP)  and 
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currently  investigating  the  effect  of  correlation  on  the  accuracy 
of  topology  inference  algorithms;  this  is  part  of  a  more  general 
effort  to  characterize  network  traffic  correlation  and  its  effects 
on  end-to-end  measurements  based  inference. 

Acknowledgment.  We  thank  Don  Towsley  for  useful  comments 
and  suggestions. 

Appendix 

The  proof  of  Theorem  1  is  based  on  the  following  result.  We 
will  find  it  useful  to  identify  a  subset  5  of  y  as  a  stratum  if 
{R{k)  :  k  €  S}  is  a  partition  of  R. 

Lemma  1:  Let  5  be  a  stratum.  Then, 

(i)  a  pair  of  nodes  {i,  fc}  C  S  are  siblings  if  and  only  if 


Fig.  8.  ns  Simulation.  Fraction  of  incorrectly  classified  topologies  for  dif¬ 
ferent  classifiers  as  function  of  the  number  of  probes. 


A{j,  k)  < 


mm 

{f  ,k'}cS:\{j'  ,k'}n{j,k}\=l 


A{j',k'y,  (18) 


exponential  on-off  sources  using  UDP.  Averaged  over  the  differ¬ 
ent  simulations,  the  link  loss  ranges  between  1%  and  13%  and 
link  utilization  ranges  between  10%  and  88%. 

Figure  8  plots  the  fraction  of  incorrectly  identified  topologies 
over  100  simulations.  The  relative  accuracy  among  the  different 
classifiers  is  in  good  agreement  with  the  results  from  the  model 
simulations.  Performance  of  the  utilization  and  joint  classifiers 
are  somewhat  inferior  due  to:  (i)  wide  spread  of  link  utilization 
values  among  the  different  links;  (ii)  presence  of  spatial  corre¬ 
lation  among  probe  delays.  In  the  simulations,  probes  are  more 
likely  to  experience  similar  level  of  congestion  on  consecutive  or 
sibling  links  than  dictated  by  the  modes  independence  assump¬ 
tion.  We  calculated  the  off-diagonal  elements  of  the  correlation 
matrix  of  the  actual  link  delays.  The  mean  was  0.021  and  the 
maximum  0.17.  Despite  correlation  affected  its  accuracy,  JBT 
shows,  albeit  reduced,  performance  gain  over  BT. 

In  the  simulations  we  also  observed  the  presence  of  short¬ 
term  temporal  correlation  among  successive  probes  that  encoun¬ 
tered  the  same  congestion  events.  This  does  not  affect  estimator 
consistency,  although  the  convergence  rate  may  be  slowed. 

VII.  Conclusions 

In  this  paper  we  have  presented  an  algorithm  for  the  inference 
of  the  multicast  tree  topology  from  end-to-end  measurements. 
The  algorithm  combines  different  performance  measures  and  re¬ 
construct  the  tree  by  adaptively  choosing  that  which  insures  the 
best  accuracy.  This  is  accomplished  by  a  careful  enumeration  of 
all  the  possible  erroneous  decisions  and  by  estimation  of  their 
probability.  These  estimates  in  turn  can  be  used  to  determine 
the  number  of  probe  packets  to  achieve  a  desired  level  of  accu¬ 
racy. 

We  investigated  the  statistical  properties  of  the  algorithm  and 
showed  that  it  is  consistent.  Analysis  of  a  simple  scenario 
showed  that  it  can  significantly  outperform  any  of  the  algorithms 
previously  considered.  We  also  used  simulation  to  evaluate  its 
accuracy  and  found  out  that,  in  general,  it  required  many  fewer 
probes  to  correctly  identify  the  topology  than  other  approaches, 
ns  experiments  showed  that  spatial  correlation  negatively  af¬ 
fects  its  accuracy.  We  believe  that  diversity  of  traffic  in  real  net¬ 
works  makes  large  and  long  lasting  correlation  unlikely.  We  are 


(ii)  if  U,  fc}  C  5  are  such  that 

A{j,k)  =  min  A{j',k')  (19) 

{f,k'}<ZS 

then  {_),  k}  are  sibling; 

(Hi)  if  {_),  fc}  C  5  is  a  pair  of  sibling  nodes,  then  {S  \  {j,  fc})  fl 
{a{j,  fc)}  is  a  stratum. 

Proof:.  Observe  first  that  by  definition  of  stratum,  if  j  G  S,  then 
no  ancestor  or  descendent  of  j  can  belong  to  S.  (i)  the  only  if 
part  follows  from  the  observation  that  if  j  and  k  are  sibling,  then 
o{j,  k)  A  a{j,  £),  a{l,  k)  for  any  £  £  S  \  {j,  k}  which  implies 
A{j,  k)  <  A{j,  £),A{£,  k).  For  the  if  part  assume  that  {j,  k}  C 
S  satisfies  (18)  and  suppose  j  and  k  are  not  siblings.  Let  £  be 
the  sibling  of  j.  Then,  £  ^  S  since,  if  £  £  S,  a(j,  £)  A  a(j,  k) 
implies  A{j,£)  <  A{j,k),  contradicting  (18).  Thus,  since  S  is 
a  stratum,  there  is  a  set  of  nodes  T  =  {h, . . .  ,tni}  Q  V{i)  fi  S 
such  that  =  R{£)  since  otherwise  would 

not  cover  R.  Now  either  k  £  T  ovk  ^  T.  But  k  £  T  implies  that 
a{i,  k)  A  a{j,  k),i  &  T  so  that  A{i,  k)  <  A{j,  k)  contradicting 
(18)  while  k  ^  T  implies  that  a{j,i)  -<  o,{j,k),  i  £  T  so  that 
again  A{j,  i)  <  A{j,  k)  contradicts  (18).  Therefore  j  and  k  are 
siblings,  (ii)  then  is  an  immediate  consequence  of  (i)  and  (iii) 
follows  immediately  from  the  definition  of  stratum,  g 

Proof  of  Theorem  1.  It  suffices  to  observe  that  in  DBT,  at  the 
beginning  of  each  iteration,  R'  is  a  stratum;  therefore,  the  pair  of 
nodes  which  minimizes  A{., .)  is  always  a  pair  of  sibling  nodes. 
This  property  holds  before  the  first  loop  (R  is  a  stratum),  and 
(ii)  and  (iii)  of  Lemma  1  ensure  it  holds  subsequently,  g 

Proof  of  Theorem  3.  Since  {j,  k)  converges  almost  surely  to 

A{j,  k),  then,  with  probability  1,  for  all  sufficiently  large  n,  the 
relative  ordering  of  the  A^'^\j,  k)  is  the  same  as  that  of  A{j,  k) 
(which  observe  can  be  different  for  loss  and  utilization).  Then,  it 
suffices  to  observe  that  for  all  sufficiently  large  n,  the  two  pairs 
of  nodes  which  minimize  Ai{.,.)  and  A„(., .)  are  both  siblings 
pairs  provided  R'  is  a  stratum.  This  property  holds  before  the 
first  loop  (R  is  a  stratum),  and  (iii)  of  Lemma  1  insure  it  holds 
subsequently,  irrespectively  of  the  actual  pair  of  nodes  selected 
for  grouping.  Then  the  last  two  statements  directly  follows  from 
standard  results. 
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Abstractoln  this  paper  we  explore  the  use  of  end-to-end 
unicast  traf  c  as  measurement  prohes  to  infer  link-level  loss 
rates.  We  leverage  off  of  earlier  work  that  produced  ef  dent 
estimates  for  link-level  loss  rates  based  on  end-to-end  multi¬ 
cast  traf  c  measurements.  We  design  experiments  based  on 
the  notion  of  transmitting  stripes  of  packets  (with  no  delay 
between  transmission  of  successive  packets  within  a  stripe) 
to  two  or  more  receivers.  The  purpose  of  these  stripes  is  to 
ensure  that  the  correlation  in  receiver  observations  matches 
as  closely  as  possible  what  would  have  been  observed  if  the 
stripe  had  been  replaced  by  a  notional  multicast  probe  that 
followed  the  same  paths  to  the  receivers.  Measurements  pro¬ 
vide  good  evidence  that  a  packet  pair  to  distinct  receivers 
introduces  considerable  correlation  which  can  be  further  in¬ 
creased  by  simply  considering  longer  stripes.  We  then  use 
simulation  to  explore  how  well  these  stripes  translate  into 
accurate  link-level  loss  estimates.  We  observe  good  accu¬ 
racy  with  packet  pairs,  with  a  typical  error  of  about  1%, 
which  sign!  cantly  decreases  as  stripe  length  is  increased  to 
4  packets. 

I.  Introduction 

A.  Motivation 

As  the  Internet  grows  in  size  and  diversity,  its  internal 
performanee  beeomes  ever  more  diffieult  to  measure.  Any 
one  organization  has  administrative  aeeess  to  only  a  small 
fraction  of  the  network’s  internal  nodes,  whereas  commer¬ 
cial  factors  often  prevent  organizations  from  sharing  inter¬ 
nal  performance  data. 

One  promising  technique  that  avoids  these  problems. 
Multicast  Inference  of  Network  Characteristics  (MINC), 
uses  end-to-end  multicast  measurements  to  infer  link-level 
loss  rates  and  delay  statistics  by  exploiting  the  inherent 
(and  well  characterized)  correlation  in  performance  ob¬ 
served  by  multicast  receivers.  These  measurements  do  not 
rely  on  administrative  access  to  internal  nodes  since  the  in¬ 
ference  can  be  calculated  using  only  information  recorded 
at  the  end  hosts. 

The  key  intuition  for  inferring  packet  loss  is  that  the  ar¬ 
rival  of  a  packet  at  a  given  internal  node  can  be  directly 
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inferred  from  the  packet’s  arrival  at  one  or  more  receivers 
reached  from  the  source  by  paths  through  that  node;  if  it 
makes  it  to  the  receivers,  it  must  have  made  it  to  the  node. 
Conditioning  on  arrival  at  a  descendent,  we  can  determine 
the  probability  of  successful  transmission  to  and  beyond 
the  given  node.  Efficient  inferencing  algorithms  are  given 
in  [2]  for  loss,  [15]  for  delay  distributions,  [7]  for  delay 
variances,  and  [3]  for  inferring  the  logical  multicast  tree 
topology  itself. 

Although  significant  advances  have  been  made  in  the 
use  of  multicast  measurements  for  inferring  internal  net¬ 
work  behavior,  it  suffers  from  two  serious  deficiencies. 
Firsf,  fhere  remain  significanf  porfions  of  fhe  Infernef  fhaf 
do  nof  supporf  nefwork-level  mulficasf.  Second,  fhe  infer¬ 
nal  performance  observed  by  mulficasf  packefs  offen  dif¬ 
fers  significanfly  from  fhaf  observed  by  unicasf  packefs. 
This  is  especially  serious  given  fhaf  unicasf  fraffic  con- 
sfifufes  far  and  away  fhe  largesf  porfion  of  fhe  fraffic  on 
fhe  Infernef.  Thus  fhere  is  a  need  for  fechniques  based 
on  end-fo-end  unicasf  measuremenfs.  This  poses  a  signif¬ 
icanf  challenge  because  unicasf  measuremenfs  do  nof  ex- 
hibif  fhe  well-behaved  correlafion  exhibifed  by  mulficasf. 
Thus,  fhe  challenge  addressed  in  Ibis  paper  is  fhaf  of  de¬ 
veloping  unicasf-based  measuremenf  fechniques  fhaf  cre- 
afe  sufficienf  correlafion  fo  yield  fruifful  inference. 

B.  Contribution 

In  Ibis  paper  we  adapf  fhe  mulficasf  inference  fech¬ 
niques  proposed  in  [2]  fo  perform  inference  of  infernal 
nefwork  characferisfics  from  unicasf  end-fo-end  measure¬ 
menfs.  The  dafa  for  fhe  inference  comprises  measured 
end-fo-end  loss  of  unicasf  probes  senf  from  a  source  fo  a 
number  of  desfinafions.  This  is  used  fo  infer  fhe  loss  and 
delay  characferisfics  of  each  logical  link  of  fhe  source  free 
joining  fhe  source  lo  fhe  desfinafions,  i.e.,  of  fhe  composife 
palhs  belween  ifs  branch  poinls. 

The  idea  is  lo  conslrucl  composife  probes  of  unicasf 
packefs  whose  collecfive  sfalislical  properties  closely  re¬ 
semble  Ihose  of  a  mulficasf  packef.  We  shall  speak  of 
striping  a  group  of  unicasf  packefs  across  a  sel  of  desli- 
nalions.  This  enlails  dispalching  fhe  packefs  back-lo-back 


from  a  source,  each  packet  potentially  having  a  different 
destination  address.  Our  premise  is  that  when  the  duration 
of  network  congestion  events  exceeds  the  temporal  width 
of  the  stripe,  packets  should  have  very  similar  experience 
of  the  network  upon  traversing  common  portions  of  the 
paths  to  their  destinations.  If  the  experiences  were  iden¬ 
tical,  the  packets  from  a  stripe  that  attempt  to  traverse  a 
given  link  would  either  all  be  lost,  or  encounter  identical 
delay.  Hence  the  packet  loss  and  delays  on  a  given  link 
would  be  perfectly  correlated  within  a  stripe;  the  compos¬ 
ite  probe  would  have  the  same  statistical  properties  as  a  no¬ 
tional  multicast  packet  that  followed  the  same  source  tree. 
In  this  case  the  methods  of  [2],  [7],  [15]  could  be  applied 
immediately  to  infer  the  per  link  loss  and  delay  statistics 
of  the  logical  source  tree. 

However,  correlations  within  stripes  may  be  less  than 
perfect  in  practice.  This  is  because  congestion  events  may 
not  affect  packets  uniformly,  subjecting  stripes  to  disper¬ 
sion  as  they  travel  through  a  network.  Some  mechanisms 
by  which  this  can  happen  are  the  following.  Packet  loss 
will  not  be  uniform  during  loss  events  that  are  narrower 
than  the  stripe,  or  those  that  start  or  stop  while  the  stripe  is 
in  progress.  Furthermore,  delays  will  vary  due  to  interleav¬ 
ing  of  background  traffic,  e.g.,  when  moving  from  a  low 
to  a  high  capacity  link.  Although  such  effects  should  be 
small  for  sufficiently  narrow  stripes,  they  will  be  cumula¬ 
tive.  Packet-dropping  on  the  basis  of  Random  Early  Detec¬ 
tion  (RED)  [9]  is  another  mechanism  by  which  packet  loss 
may  become  decorrelated.  It  remains  to  be  seen  whether 
this  mechanism  will  be  widely  deployed  in  communica¬ 
tions  networks.  On  the  other  hand,  the  use  of  RED  to 
merely  mark  packets  will  not  break  correlations. 

This  motivates  four  strands  of  work  in  this  paper: 

(i)  determining  the  magnitude  of  imperfect  correlations 
through  experiments  on  real  networks; 

(ii)  calculating  their  likely  impact  on  the  accuracy  of  in¬ 
ference  methods  that  assume  perfect  correlations; 

( Hi )  adopting  measurement  procedures  that  reduce  the  im¬ 
pact  of  imperfect  correlations; 

(iv)  verifying  the  accuracy  of  the  approach  in  simulations. 

We  extend  the  packet  loss  model  of  [2]  by  incorporat¬ 
ing  an  additional  parameter  for  each  link  that  describes  the 
correlation  of  loss  between  different  packets  of  the  same 
stripe.  This  is  done  for  binary  stripes,  i.e.,  those  compris¬ 
ing  two  packets  with  different  destination  addresses.  These 
additional  parameters  cannot  themselves  be  determined  by 
end-to-end  measurements,  at  least  not  without  additional 
assumptions  relating  them  to  each  other,  or  to  the  existing 
loss  rate  parameters.  These  calculations  show  that  the  er¬ 
ror  in  using  the  loss  estimator  from  [2]  is  small  provided 
that  the  conditional  probability  of  loss  of  one  packet  in  the 


stripe  given  transmission  (i.e.,  non-loss)  of  the  other,  is 
small  compared  with  the  marginal  loss  rate  in  the  stripe. 
This  is  a  condition  that  we  will  verify,  at  least  for  end-to- 
end  paths,  through  measurement. 

By  constructing  appropriate  stripes  of  composite  probes 
and  selecting  subsets  of  these  probes  for  inference,  we  are 
able  to  enhance  correlations  within  data  used  for  inference. 
This  is  possible  when  packet  transmissions  are  correlated 
in  the  sense  that  a  given  packet  in  a  stripe  is  more  likely 
to  be  transmitted  across  a  given  link  when  other  pack¬ 
ets  within  the  stripe  are  known  to  have  been  transmitted 
across  the  link.  By  conditioning  on  the  measurable  event 
that  nearby  packets  have  been  transmitted  end-to-end,  we 
can  raise  the  likelihood  of  transmission  of  a  given  packet  to 
an  intermediate  node  closer  to  one.  By  sending  the  stripe 
packets  to  diverse  addresses,  we  can  infer  the  properties  of 
internal  network  paths  from  the  measurements. 

The  rest  of  the  paper  is  as  follows.  In  Section  II  we  for¬ 
mulate  the  stripe  method,  first  for  the  binary  tree  of  depth 
two,  and  then  for  general  trees.  We  specify  a  family  of 
different  striping  methods.  We  specify  the  required  cor¬ 
relation  assumption  between  packet  transmissions  within 
stripes,  and  show  that  it  can  be  used  to  construct  a  hierar¬ 
chy  amongst  the  various  striping  methods;  in  particular  we 
establish  an  order  relation  for  the  degree  of  correction  each 
method  gives  to  the  bias  caused  by  imperfect  correlations. 

We  use  two  experimental  approaches  to  evaluate  the 
proposed  method.  In  Section  III  we  use  end-to-end  mea¬ 
surement  on  the  National  Internet  Measurement  Infras¬ 
tructure  (NIMI)  [19]  to  gather  data  from  a  diverse  set  of 
Internet  paths.  We  transmitted  stripes  between  pairs  of 
end-hosts  and  verified  fhaf  fheir  packef  loss  sfafisfics  were 
consisfenf  wifh  fhe  correlafion  assumpfions  fhaf  underlie 
fhe  mefhod.  (These  sfripes  were  differenl  from  fhose  de¬ 
fined  above,  since  all  packefs  in  fhe  sfripes  were  senf  fo 
fhe  same  destination;  see  Section  III-A  for  discussion  of 
fhis  approach.)  We  also  esfimafed  fhe  likely  accuracy  fhaf 
would  be  obfained  by  sfripe -based  inference  in  fhe  acfual 
nefwork. 

We  supporf  fhis  work  in  Section  IV  using  nefwork  level 
simulation  wifh  ns  [17].  By  insfrumenfing  fhe  simulation 
we  can  frace  fhe  behavior  of  packefs  in  fhe  nefwork  infe¬ 
rior.  This  allows  us  firsl  fo  sfudy  fhe  correlafion  properties 
of  packefs  wifhin  sfripes  as  fhey  are  fransmiffed  across  in¬ 
dividual  links  in  fhe  nefwork  (rafher  fhan  jusf  fhe  end-fo- 
end  properfies),  and  second  fo  compare  fhe  inferred  link 
loss  rales  wifh  acfual  link  loss  rales.  Eor  fhe  mosl  accurale 
choice  of  slriping  mefhod  we  find  fhe  lypical  absolufe  er¬ 
ror  in  loss  rale  inference  fo  be  below  1%.  We  conclude  in 
Section  V. 
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C.  Related  Work 

There  exist  several  tools  and  methodologies  for  eharae- 
terizing  link-level  behavior  from  end-to-end  unieast  mea¬ 
surements.  One  of  the  first  methodologies  foeuses  on  iden¬ 
tifying  the  bottleneek  bandwidth  on  a  unieast  route.  The 
key  idea  is  that,  in  an  uneongested  network,  two  paekets 
(paeket  pair)  sent  baek-to-baek  will  arrive  at  the  reeeiver 
with  a  spaeing  that  is  inversely  proportional  to  the  low¬ 
est  link  bandwidth  on  the  path.  This  was  noted  by  Jaeob- 
son  as  leading  to  TCP’s  “self-eloeking”  behavior  [10],  and 
formally  analyzed  by  Keshav  [12].  Carter  and  Crovella 
then  developed  a  tool  to  apply  the  teehnique  [4],  whieh  has 
sinee  been  refined  in  [13],  [18].  Although  these  method¬ 
ologies  foeus  on  a  metrie  other  than  loss  rate,  they  are 
based  on  the  same  idea,  namely  to  send  paeket  pairs  (or 
stripes)  so  as  to  introduee  eorrelation  in  a  eontrolled  man¬ 
ner. 

In  [5],  the  authors  use  end-to-end  measurements  of 
paeket  pairs  in  a  tree  eonneeting  a  single  sender  to  several 
reeeivers.  Experiments  eonsist  of  a  number  of  paeket  pairs 
where  the  paekets  are  sent  to  different  reeeivers  so  that  all 
pairs  of  reeeivers  are  eovered.  The  metries  of  interest  are 
sueeess  probabilities  of  all  links  in  the  tree.  As  the  seeond 
paeket  in  a  pair  may  not  see  the  same  loss  behavior  as  the 
first  over  the  common  path,  conditional  success  probabili¬ 
ties  are  introduced  as  unknown  nuisance  variables.  Given 
an  a  priori  distribution  for  these  two  sets  of  parameters, 
the  authors  then  use  a  Bayesian  network  approach  to  deter¬ 
mine  a  posteriori  distributions  and,  from  these,  estimates 
of  the  link  transmission  probabilities.  Preliminary  results 
on  the  method  reported  in  the  paper  show  promise.  Our  ap¬ 
proach  differs  from  the  approach  in  [5]  in  that  we  consider 
a  more  general  form  of  striping  scheme  which  results  in 
significantly  higher  correlation.  Thus  we  are  able  to  con¬ 
tinue  to  rely  on  the  maximum  likelihood  estimates  derived 
for  the  multicast  case. 

Last,  pathchar  [6],  [11]  triggers  ICMP  messages  at 
successive  routers  on  a  unicast  path  in  order  derive  link 
bandwidth,  round  trip  link  loss  rate,  and  round  trip  link  de¬ 
lay  statistics.  It  accurately  estimates  link  bandwidth  pro¬ 
vided  that  it  is  low.  It  has  not  been  well  validated  in  the 
case  of  losses  and  delays.  Moreover,  it  requires  consider¬ 
able  time  to  converge  and  loses  accuracy  with  asymmetric 
round  trip  paths. 

II.  Inference  Methodology 
A.  Models  for  Trees,  Stripes,  and  Packet  Loss 

We  first  develop  the  framework  in  which  to  describe  the 
propagation  of  stripes  of  unicast  packets  through  the  net¬ 
work.  We  represent  the  underlying  physical  network  as 


a  graph  Gphys  =  (Pphys,  ^phys)  comprising  the  physical 
nodes  fphys  (c-g-  routers  and  switches)  and  the  links  Lphys 
between  them.  We  consider  a  single  source  of  probes 
0  G  fphys  and  a  set  of  receivers  R  C  fphys-  We  assume 
that  the  set  of  paths  from  0  to  each  r  G  i?  is  stationary  and 
form  a  tree  T^hys  in  (i^phys) -^phys);  thus  two  such  paths 
never  intersect  again  once  they  have  diverged.  We  form 
the  logical  source  tree  T  =  {V,  L)  whose  vertices  V  com¬ 
prise  0,  R  and  the  branch  points  of  "^hys-  The  link  set  L 
contains  the  link  (j,  k)  if  one  or  more  of  the  probe  paths 
in  T^hys  pass  through  j  then  k  without  encountering  an¬ 
other  element  of  V  in  between.  Where  applicable,  denote 
by  f{k)EV  the  parent  of  k  E  V.  We  write  j  y  k  if  j  is 
an  ancestor  of  k  in  T. 

We  will  use  the  notation  {ri , . . . ,  r^g )  to  refer  to  a  stripe 
comprising  packets  dispatched  to  destination  nodes  in  or¬ 
der  ri , . . . ,  r^g .  We  describe  the  progress  of  the  stripe  in 
T  by  the  variables  Xk{d),  taking  the  value  1  if  packet  d 
reaches  node  k,  and  zero  otherwise.  Note  (d)  =  1  iff 
packet  d  reaches  its  destination  node.  (We  do  not  label 
packets  by  their  destination  since  we  consider  stripes  with 
repeated  destinations). 

We  will  find  it  useful  to  have  a  notation  describing 
composite  events  at  sets  of  receivers.  For  D  C  Do  = 
{1, . . . ,  do}  define  the  binary  variable 

ZD=llXr,{d).  (1) 

deD 

Thus  Zn  =  1  if  all  packets  in  D  reach  their  destina¬ 
tions,  and  0  otherwise.  We  will  find  it  convenient  to  write 
Z^di,...,  dm}  dm' 

We  specify  a  loss  model  for  the  stripes.  We  assume  that 
losses  are  independent  between  different  stripes,  and  for 
packets  of  the  same  stripe  on  different  links.  For  each 
k  E  V  let  D{k)  C  Dq  be  the  set  of  packets  that  success¬ 
fully  reach  (and  therefore  transit  across)  k.  For  D  <Z  D{k) 
let  ak{D)  denote  the  probability  that  all  packets  in  D  are 
transmitted  to  node  k,  conditioned  upon  having  reached 
the  parent  node  f{k).  We  do  not  assume  that  the  marginal 
probabilities  a^id)  are  equal  for  all  d  G  D{k).  For  dis¬ 
joint  subsets  D,D'  C  D(k)  we  write  as  /3k(D\D')  the 
conditional  probability  that  packets  in  D  are  successfully 
transmitted  across  link  k,  given  that  those  in  D  are  suc¬ 
cessfully  transmitted,  all  packets  having  reached  the  parent 
node  f{k).  This  is  expressed  in  terms  of  the  probabilities 
Ok  as 

(3kiD\D')  =  akiDUD')/akiD').  (2) 

With  perfect  correlations  the  various  fSk  would  be  1.  The 
multicast  loss  model  of  [2]  is  statistically  equivalent  to  the 
special  case  j3k{D\D')  =  1  and  hence  ak{d)  all  equal 
some  ak- 
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Fig.  1.  Two-leaf  TREE 


For  a  given  link  and  stripe  width,  we  expeet  the  strueture 
of  the  probabilities  a,  /3  to  depend  on  the  times  between 
sueeessive  paekets.  For  example,  if  the  paekets  are  widely 
separated,  then  the  marginal  probabilities  Qjt(d)  will  be 
equal  (or  nearly  so)  while  the  eonditional  probabilities  /3 
will  be  elose  to  the  marginal  probabilities  a.  Here,  we  eon- 
eentrate  on  the  other  extreme  with  baek-to-baek  paekets  in 
order  to  make  f3  elose  to  1.  In  this  paper  we  foeus  on  es¬ 
timating  transmission  probabilities  for  the  first  paeket  in  a 
stripe.  We  note  however  that  marginal  transmission  prob¬ 
abilities  ean  depend  on  the  position  of  a  paeket  within  a 
stripe,  partieularly  when  the  stripe  width  is  not  negligible 
eompared  with  buffer  sizes.  However,  our  methods  ean  be 
adapted  to  foeus  on  other  paekets  within  the  stripe.  This 
eould  be  useful  if  it  is  desired  to  infer  transmission  proba¬ 
bilities  for  paekets  in  traffie  bursts. 


B.  Inference  with  Binary  Stripes  on  the  Two-Leaf  Tree 

We  first  investigate  the  performanee  of  the  inferenee  al¬ 
gorithms  from  [2]  under  imperfeet  eorrelations.  We  start 
with  the  two-leaf  tree  shown  in  Figure  1,  having  leaf  nodes 
L  and  R  with  eommon  parent  C  whose  own  parent  is  the 
root  0.  Consider  the  binary  stripe  (l,  r).  The  link  prob¬ 
abilities  are  related  to  the  probabilities  of  leaf  events  as 
follows: 

=  ac(l)/A(l|2)  (3) 

|^  =  aai)A(l|2).  |^=a,(2)^„(2|l). 

where  Zr,  is  as  defined  in  (1).  This  is  beeause,  e.g., 
EZi2  =  acil2)a^{l)a^{2)  =  acil)Pci2\l)a^il)a^i2), 
with  similar  expressions  for  EZi  and  EZ2.  With  perfeet 
eorrelations,  /3c  =  1,  and  henee  the  a  are  uniform  aeross 
the  stripe  and  may  be  reeovered  direetly  from  the  leaf 
probabilities.  These  expressions  ean  then  be  used  to  es¬ 
timate  the  a  from  the  leaf  events  assoeiated  with  mul¬ 


tiple  identieal  stripes  *  =  1,  2, ...  n.  To  form  the  estimates 
we  first  replaee  eaeh  expeetation  in  (3)  by  the  eorrespond- 
ing  empirieal  mean,  defined  here  in  general: 

n 

ZD  =  n-^Y.^D-  (4) 

i=l 

Taking  /3c  =  1  then  yields  the  estimates 

Sc  =  Z1Z2/Z12,  Sl  =  Z12/Z2,  Sr  =  ^12/^1.  (5) 

This  is  effeetively  the  estimator  from  [2]  applied  to  the 
two-leaf  tree. 

With  imperfeet  eorrelations,  /3c  eannot  be  reeovered  in¬ 
dependently  from  the  leaf  expeetations.  The  model  is  not 
identifiable;  this  was  also  observed  in  [5].  Sinee  I3q  <  1, 
estimation  via  (5)  is  biased,  overestimating  and  under¬ 
estimating  ofL  and  Or. 


C.  Enhancing  Stripe  Correlations 

The  uneertainty  over  the  values  of  the  /3  undermines 
eonfidenee  in  using  (5)  direetly.  We  now  propose  a  modi¬ 
fied  striping  seheme  seheme  for  whieh  the  effeetive  value 
of  the  /3  is  eloser  to  1.  To  glimpse  the  idea  behind  this, 
observe  that  for  the  stripe  (l,  r)  with  perfeet  eorrelations, 
EZ12/EZ2  (defined  as  the  eonditional  probability  for  the 
first  paeket  of  the  stripe  to  reaeh  L  given  that  its  seeond 
paeket  reaehes  c)  is  aetually  equal  to  the  probability  of 
transmission  of  a  paeket  along  the  link  (c,  l),  eonditional 
upon  reaehing  c.  This  is  beeause  paeket  2  must  have  been 
present  at  c  if  present  at  R.  With  imperfeet  eorrelations, 
paeket  1  may  not  have  been  also  present  at  C,  leading  to 
underestimation  of  cvl  .  Our  remedy  for  this  is  to  use  longer 
stripes,  eonditioning  on  an  event  at  R  whieh  makes  it  more 
likely  that  paeket  1  was  present  at  C. 

The  simplest  example  of  sueh  a  stripe  is  the  three- 
packet  stripe  {l,r,  r).  Provided  that  transmission  of 
paekets  within  the  stripe  is  strongly  eorrelated  (in  a  sense 
we  make  preeise  below)  we  expeet  it  to  be  more  likely  that 
paeket  1  reaehes  C,  upon  reeeption  of  paekets  2  and  3  at 
reeeiver  L,  rather  than  reeeption  of  paeket  2  alone.  We  for¬ 
malize  the  required  notion  of  eorrelation  in  Definition  1 
below. 

Upon  replaeing  the  reeeption  of  paeket  2  with  the  reeep¬ 
tion  of  paekets  2  and  3,  the  analogs  of  the  first  and  seeond 
relations  in  (3)  are 


EZ1EZ23 

E^123 


Q:c(1)  EZ123 

/3c  (1 1 23)’  EZ23 


ac(l)/3c(l|23).  (6) 


The  parameters  Oq  and  cvl  are  estimated  by  Z1Z23/  Z123 
Zi2zt Z2Z  respeetively;  cvr  ean  be  estimated  similarly  us¬ 
ing  the  eomplementary  stripe  (r,  L,  l).  Comparing  with 
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(5)  we  observe  that  these  estimates  introduee  less  bias  than 
those  from  two-paeket  stripes  provided  that  /5|3(1|2,  3)  > 
/3c(2|l).  This  is  the  ease  provided  that  transmissions 
within  a  stripe  satisfy  the  following  eorrelation  property. 

Definition  1:  We  say  that  stripe  transmission  at  a  node  k 
is  coalescent  if  for  eaeh  stripe  (ri, . . . ,  rfi)  routed  through 
k,  and  disjoint  D,D'  C  D(k), 

/3k{D\D')  >  /3k{D\D")  for  all  D"  C  D' .  (!) 

Coaleseenee  is  a  eorrelation  property.  It  states  that  a  given 
set  of  paekets  D  is  more  likely  to  be  transmitted  on  a  link, 
the  more  other  paekets  from  the  stripe  have  been  transmit¬ 
ted.  We  will  investigate  the  eoaleseenee  properties  of  real 
network  traffie  in  Seetion  III. 

With  eoaleseenee,  whenever  we  add  paekets  to  the  eon- 
ditioning  event,  the  effeet  is  to  deerease  the  estimate  of 
and  to  inerease  the  estimate  of  or  cvr.  Thus,  we  ean 
eounteraet  the  bias  in  the  two-leaf  stripe,  evident  from  (3), 
by  using  wider  stripes. 

Theorem  1:  Assume  transmission  is  eoaleseent  on  the 
two-leaf  tree  and  eonsider  a  stripe  {D{c))  and  two  disjoint 
subsets  D,D'  of  D{c)  sueh  that  paekets  in  D  have  desti¬ 
nation  L  and  paekets  in  O'  have  destination  R.  Then  for 
any  D"  C  D', 

E-^dud'  ^  EZjiuj)// 

EZr,'  ~  EZ^,//  ■  ^  ’ 

The  inequality  (8)  eaptures  the  effeet  that  extending  the 
stripe  reduees  the  estimate  of  the  transmission  rate  and 
so  eounteraets  the  bias  due  to  /3c  <  1. 

Proof:  EZdud'  =  ficiD\D')aciD')  a^{D)a^{D') 
while  EZi)!  =  ac{D')a^{D').  Henee  EZuiiui /EZ^i  = 
/3c(T>|T>')«l(T»)  >  fic{D\D")a^{D)  =  EZdud" . 

■ 

Example:  the  4-packet  stripe.  Theorem  1  suggests  we  ean 
further  reduee  bias  by  lengthening  the  stripe  length.  Con¬ 
sider,  for  instanee,  the  stripe  (l,  r,  r,  r)  and  eompare  its 
estimation  properties  with  those  of  its  substripes  (l,  r,  r) 
and  (l,  r).  By  Theorem  1  we  have  the  following  ordering 
between  the  funetional  on  whieh  estimates  of  Qq  are  based 
in  eaeh  ease: 

EZ1EZ2U  ^  EZ1EZ23  ^  EZ1EZ2 
E^1234  ~  EZ123  “  EZ12 

The  estimators  are  obtained  by  replaeing  eaeh  EZ  by  the 
eorresponding  empirieal  mean  Z  from  n  stripes.  By  the 
Law  of  Large  Numbers,  the  same  inequalities  hold  for  the 
estimates  with  probability  1  as  n  grows  to  infinity. 

D.  Extension  to  General  Trees 


whieh  the  depth  and  branehing  ratio  ean  be  greater  than 
2.  Consider  first  the  ease  of  a  depth  2  tree  with  an  arbi¬ 
trary  number  of  leaves.  One  approaeh  is  to  stripe  aeross 
all  reeeivers  and  then  to  adapt  the  estimator  from  [2]  for 
nodes  with  arbitrary  numbers  of  offspring  in  order  to  esti¬ 
mate  the  link  probabilities.  A  potential  problem  with  this 
approaeh  is  that  the  statistieal  properties  of  stripes  may  not 
refleet  those  of  general  traffie  if  fheir  widfh  is  nof  negligi¬ 
ble  eompared  wifh  buffer  sizes.  For  fhe  same  reason,  vari¬ 
ation  of  sfripe  widfh  wifhin  a  single  sef  of  measuremenfs 
may  infroduee  non-uniform  bias  info  fhe  link  probabilify 
esfimafes,  depending  on  fhe  loeal  branehing  rafio.  Insfead, 
here  we  foeus  on  eombining  inferenee  from  fixed-widlh 
sfripe  measuremenfs  on  embedded  subfrees. 

Consider  an  arbifrary  free  wifh  leaf  sef  R.  For  eaeh  node 
k  lef  R{k)  denote  fhe  subsef  of  leaves  deseended  from 
k.  Lef  Q{k)  denote  fhe  sef  of  ordered  pairs  of  nodes  in 
R{k)  deseended  fhrough  differenf  ehildren  of  k.  For  eaeh 
(Ri,  R2)  G  Q{k),  eonsider  fhe  embedded  Iwo-leaf  binary 
free  spanned  by  fhe  nodes  0,  k,  Ri,  R2.  By  eombining  esfi- 
mafes  from  measuremenfs  of  sfripes  down  eaeh  sueh  free, 
we  shall  esfimafe  fhe  eharaeferisfies  of  fhe  eommon  pafh 
from  0  fo  A:. 

Eaeh  sfripe  will  follow  fhe  same  pattern.  We  fix  a  fem- 
plafe  for  a  sfripe  of  do  paekefs  by  partitioning  {1, . . . ,  do} 
info  fwo  sefs  Di,  D2.  For  eaeh  ordered  pair  (Rj^,  Rj2)  of 
disfinef  reeeivers  in  R{k)  we  form  a  sfripe  fhaf  sends  paek¬ 
efs  in  posifions  in  Di  fo  Rj^  and  paekefs  in  posifions  in 
D2  fo  Rj2.  More  formally,  fhis  is  fhe  sfripe  S{ii,i2)  = 
{ri,---,rdo)  where  ra  =  R4  when  d  E  Dk. 

The  relafion  befween  fhe  leaf  probabilities  and  fhe  frans- 
mission  probabilities  on  fhe  eomposife  pafh  from  0  fo  A:  are 
expressed  fhrough 


EZp^  EZj)2 
E^R)iUR>2 


Ak{D^)/Bk{DfD2). 


(10) 


where  =  Yljyk  non¬ 

leaf  and  non-roof  node  k,  eaeh  pair  {i,  j)  G  Q{k),  fhe  mea¬ 
suremenfs  wifh  n  sfripes  of  fype  S{i,j)  fhus  gives  rise  fo 
an  esfimafe 

K  ry 

^Di  [JD2 

In  fhe  experimenfs  deseribed  in  fhis  paper  we  eombine  all 
possible  esfimafes  fhrough  fheir  arifhmefie  mean 


Ak  =  #Q{k)-^  Y.  Al^^.  (12) 


We  deseribe  esfimafors  fhaf  extend  fhe  foregoing 
mefhod  fo  freaf  general  logieal  souree  frees,  i.e.,  frees  in 


For  leaf  nodes  k  lake  A^  as  fhe  measured  Iransmission 
probabilify  over  all  sfripes  of  paekefs  fo  k,  and  sef  Aq  =  1 
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by  convention.  The  link  probability  estimates  are  then  ex¬ 
pressed  as  quotients 

ak  =  k^O.  (13) 

E.  Sampling  and  Statistical  Issues 

Earlier  in  this  section  we  proposed  using  wider  stripes 
as  a  way  of  counteracting  the  inherent  bias  in  using  esti¬ 
mators  that  do  not  take  explicit  account  of  the  imperfect 
correlations  between  stripe  packets.  We  now  make  a  num¬ 
ber  of  further  observations  of  the  statistical  implications  of 
using  the  stripe  approach. 

First,  increasing  the  stripe  width  while  keeping  the  total 
number  of  packets  sent  constant  increases  the  variance  of 
the  estimates.  This  is  because  the  number  of  stripes  sent  is 
in  inverse  proportion  to  their  width. 

Second,  network  characteristics  may  not  be  uniform 
across  a  stripe  e.g.,  if  stripe  width  is  comparable  in  size  to 
that  of  a  buffer.  Here  we  focused  on  estimating  transmis¬ 
sion  probabilities  for  the  first  packet;  other  templates  could 
direct  attention  to  other  positions.  We  note  that  if  marginal 
transmission  rates  are  highly  heterogeneous  across  differ¬ 
ent  positions  in  a  stripe,  then  the  assumption  of  indepen¬ 
dent  packet  loss  on  different  links  may  not  hold.  This  is  be¬ 
cause  its  expected  loss  rate  of  a  packet  at  a  given  node  can 
depend  on  the  occurrence  of  losses  closer  to  the  source  of 
packets  in  earlier  stripe  positions.  These  cause  the  packet 
to  advance  its  position  in  the  stripe  and  consequently  ex¬ 
perience  a  different  loss  rate. 

Third,  there  is  a  phenomenon  during  TCP  slow  start 
that  can  lead  to  every  other  or  every  third  packet  being 
lost.  Once  TCP  increases  its  window  enough  to  “fill  fhe 
pipe,”  which  corresponds  fo  fransmiffing  af  fhe  boffleneck 
rale,  Ihen  fhe  nexl  sef  of  acknowledgemenls  effectively  in¬ 
creases  fhe  sending  rale  by  eilher  a  faclor  of  Iwo  (if  fhe  re¬ 
ceiver  acknowledges  every  incoming  packel)  or  a  faclor  of 
1.5  (if  Ihe  receiver  uses  Ihe  common  “ack  every  olher”  pol¬ 
icy).  If  Ihe  bottleneck  buffer  is  full  al  Ihis  poinl,  Ihen  eilher 
every  olher  or  every  Ihird  packel  will  be  losl  al  Ihe  bottle¬ 
neck  due  lo  Ihe  mismalch  belween  Ihe  bottleneck  rale  and 
Ihe  higher  sending  rale.  See  Figure  2  of  [8]  for  an  illus- 
Iralion.  Accordingly,  Ihere  may  be  buffer-filling  patterns 
presenl  in  Ihe  nelwork  lhal  imparl  particular  loss  patterns 
on  Ihe  elemenls  of  a  slripe.  The  prevalence  of  Ihe  “slow 
slarl”  pattern  will  depend  on  how  often  TCP  connections 
in  slow  slarl  dominate  Ihe  consumption  of  buffer  space  al 
Ihe  bottleneck  link. 

Fourlh,  we  have  observed  lhal  imperfecl  correlations  al 
a  node  bias  inference  for  parenl  and  child  links  in  opposite 
directions.  Hence  bias  is  a  second  order  effecl  spatially, 
depending  nol  on  Ihe  absolute  loss  correlation,  bul  ralher 
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on  Ihe  manner  in  which  il  changes  from  node  lo  node  in 
Ihe  nelwork.  In  Ihe  special  case  of  Ihe  probabilities  a,  f3 
being  uniform  over  all  links,  imperfecl  correlations  aclu- 
ally  leave  Ihe  estimates  (5)  unbiased  for  internal  links  (i.e. 
all  Ihose  excepl  Ihe  leaf  links  and  rool  link),  Ihough  Ihis 
special  case  seems  highly  unlikely  in  practice. 

Fifth,  Ihe  analysis  of  estimator  variance  for  mullicasl  in¬ 
ference  carries  over  when  /3  «  1.  We  refer  Ihe  reader  to  [2] 
for  delails.  Here  we  mention  lhal  in  a  regime  for  which  all 
loss  rates  'cik  =  I  —  ak  we.  close  to  zero,  Ihe  estimator  elk 
has  variance  which  behaves  as  n~^  [eik  +  ||«|P)?  asymp¬ 
totically  for  large  numbers  n  of  probes.  To  leading  order, 
Ihis  form  is  independenl  of  topology. 

HI.  Network  Experiments 

The  estimation  techniques  described  in  Section  II  rely 
on  conditional  probabilities  of  packel  Iransmission  wilhin 
slripes  being  close  to  1,  and  on  Ihe  coalescence  properly 
in  order  to  counleracl  Ihe  bias  due  to  shorlcomings  wilh 
Ihis  assumption.  In  Ihis  section  we  investigate  confor¬ 
mance  of  bolh  of  Ihese  assumptions  to  measuremenls  of 
slripes  Iransmilled  across  a  number  of  end-to-end  palhs 
in  Ihe  Inlernel.  Allhough  Ihese  experimenls  did  nol  ac¬ 
cess  Ihe  Iransmission  properties  of  individual  links  (logis- 
lically  very  diflicull  to  measure),  Ihey  would  be  able  to 
deled  link-wise  deparlures  from  Ihe  assumptions,  since 
Ihese  would  also  be  reflected  in  Ihe  properties  of  end-to- 
end  palhs  over  non-conformanl  links. 

A.  Measurement  Infrastructure 

We  conducted  Ihe  experimenls  using  Ihe  National  In- 
lernel  Measuremenl  Infraslruclure  (NIMI)  [19].  NIMI 
consisls  of  a  number  of  measuremenl  platforms  deployed 
across  Ihe  Inlernel  (primarily  in  Ihe  U.S.)  lhal  can  be 
used  to  perform  end-to-end  measuremenls.  We  made  Ihe 
measuremenls  using  Ihe  zing  ulilily,  which  sends  UDP 
packels  in  seleclable  patterns,  recording  Ihe  time  of  Irans¬ 
mission  and  reception.  We  extended  zing  to  transmit 
unicast  stripes  to  multiple  destinations,  minimizing  the 
spacing  between  packets  in  a  stripe  by  precomputing  the 
packets  to  send  (including  their  MD5  integrity  checksum, 
the  most  computationally  expensive  part  of  constructing 
a  zing  packet)  and  then  transmitting  them  with  back- 
to-back  system  calls,  resulting  in  inter-packet  spacings  of 
about  40/L/sec. 

A  key  point  is  that  for  our  measurements  we  did  not  ac¬ 
tually  send  packets  to  multiple  destinations,  because  we 
had  no  way  of  calibrating  true  inference  of  internal  loss 
characteristics,  which  would  require  measurement  inside 
the  network.  Instead,  the  results  we  report  are  all  for 
stripes  sent  to  the  same  destination,  with  the  goal  being 


A(1) 


Fig.  2.  Scatter  Plot  of  Transmission  Probabilities 
IN  28  Network  Experiments.  Conditional  vs.  marginal 
end-to-end  transmission  probabilities.  Probabilities  for  3- 
packet  stripes  mostly  meet  or  exceed  those  for  2-packet 
stripes. 


to  assess  the  conditional  loss  probability  and  coalescence 
properties. 

We  gathered  a  total  of  63  successful  measurements  be¬ 
tween  35  NIMI  sites,  each  measurement  recording  at  both 
sender  and  receiver  the  transmission  of  either  100,000 
flights  of  stripes  of  3  packets,  with  separations  exponen¬ 
tially  distributed  with  a  mean  of  100  msec;  10,000  flights 
of  stripes  of  10  packets,  separated  by  a  mean  of  300  msec 
(we  also  analyzed  the  first  3  packets  in  each  stripe  as 
another  dataset  of  3-packet  stripes);  or  20,000  flights  of 
stripes  of  3  packets,  separated  by  a  mean  of  500  msec. 
All  measurements  were  made  at  either  2PM  EDT  (a  busy 
time)  or  2AM  EDT  (a  fairly  unloaded  time).  There  was  no 
noticeable  change  in  behavior  as  we  varied  the  inter-stripe 
spacing  from  100  msec  to  500  msec. 

Of  the  63  traces,  7  exhibited  no  loss  whatsoever,  and 
consequently  we  had  to  eliminate  them  as  they  could  not 
be  used  to  study  loss  inference.  Of  the  remaining  56,  fully 
half  (28)  had  conditional  loss  probabilities  of  1,  reflect¬ 
ing  perfect  loss  correlation  just  as  we  would  have  if  using 
multicast  traffic  instead  of  unicast.  This  finding  is  highly 
encouraging  for  the  efficacy  of  unicast  loss  inference. 

In  the  remainder  of  this  section,  we  analyze  the  proper¬ 
ties  of  the  28  traces  that  did  not  exhibit  perfect  correlation. 


6(1|2,...,«;)/6(1|2,...,r;- 

1) 

w  =  2 

w  =  3 

w  =  4 

w  =  5 

w  =  6 

min. 

1.0000 

1.0000 

1.0000 

1.0000 

1.0000 

mean 

1.0189 

1.0002 

1.0000 

1.0001 

1.0001 

max. 

1.1812 

1.0021 

1.0003 

1.0005 

1.0003 

TABLE  1 

Coalescence  of  transmission  in  network 
experiments.  Ratios  of  end-to-end  conditional 

TRANSMISSION  PROBABILITIES  IN  STRIPES  OF  WIDTH  2  TO 

6.  Minimum,  mean  and  maximum  of  ratios  observed 
IN  19  TRACES  STRIPES  OF  WIDTH  10.  MINIMUM  RATIO  1 
conforms  with  COALESCENCE  PROPERTY. 


B.  Transmission  Probabilities 

Marginal  Probabilities.  The  packet  loss  rate  varied  be¬ 
tween  zero  and  about  14%  over  the  experiments.  The 
marginal  packet  loss  rates  for  different  positions  in  the 
stripe  displayed  some  heterogeneity.  The  heterogeneity 
was  most  pronounced  at  the  start  of  the  stripe,  with  the  loss 
rate  for  the  second  packet  in  a  stripe  being  typically  1.19 
times  greater  than  that  of  the  first.  Moving  further  along 
the  stripe,  loss  rates  differed  between  successive  positions 
typically  by  up  to  a  typical  factor  of  1.03. 

Conditional  Probabilities.  We  can  estimate  the  error  in¬ 
volved  in  the  stripe  method  by  comparing  conditional  and 
marginal  transmission  probabilities  within  the  stripe.  A 
scatter  plot  of  the  conditional  vs.  marginal  probabilities 
for  2  and  3  packet  stripes  in  28  experiments  is  shown  in 
Eigure  2.  Higher  points  represent  smaller  relative  error; 
conversely  for  points  near  the  line  the  error  is  of  the  same 
order  of  magnitude  as  the  marginal  probability  to  be  esti¬ 
mated.  Eor  both  2  and  3  packet  stripes,  the  end-to-end  con¬ 
ditional  transmission  probabilities  B  are  noticeably  larger 
than  the  marginal  transmission  probabilities  A,  with  those 
for  the  3  packet  stripe  being  at  least  as  large  as  those  for  the 
2  packet  stripes  in  almost  all  cases.  A  conditional  probabil¬ 
ity  of  1  would  signify  perfect  correlations.  We  can  char¬ 
acterize  this  error  arising  from  .B  <  1  through  the  ratio 
[1  —  —  A)  when  A  /  1.  This  represents  the  propor¬ 

tion  of  the  reported  loss  rate  which  is  typically  in  error  due 
to  imperfect  correlations.  Eor  2-packet  stripes,  the  median 
value  of  this  ratio  was  0.12.  (So,  for  example,  an  esti¬ 
mated  loss  rate  of  1%  would  be  in  error  by  about  0.12%). 
The  median  ratio  fell  to  to  0.09  for  3  packet  stripes. 
Coalescence  We  calculated  end-to-end  conditional  trans¬ 
mission  probabilities  B(l|2,  3, . . .  m)  for  stripes  of  width 
w  between  1  and  6.  (When  w  =  1  this  just  denotes  the 
marginal  probability  A(l)).  A  necessary  condition  for  co¬ 
alescence  is  that  the  ratios  .6(1 1 2,  ...,m)/6(l|2,...,m  — 
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1)  be  >  1.  We  determined  the  ratios  over  19  experiments 
with  stripes  of  width  10.  In  only  two  instances  were  the 
ratios  less  than  1,  and  in  these  cases  by  a  magnitude  of 
only  about  10“®.  This  is  a  far  smaller  magnitude  than  that 
by  which  the  ratio  typically  exceeds  1,  as  is  seen  from  the 
statistics  displayed  in  Table  I:  the  minimum,  mean,  and 
maximum  for  each  w  over  the  19  experiments.  The  ratios 
are  largest  for  w  =  2,  falling  off  close  to  1  as  m  increases 
beyond  3.  This  suggests  that  the  additional  bias  correction 
obtained  by  increasing  stripe  width  is  almost  negligible  for 
stripes  wider  than  3  packets,  at  least  under  the  network 
conditions  and  the  range  of  loss  probabilities  exhibited  in 
these  traces. 

C.  Interpretation 

The  network  experiments  are  encouraging  for  unicast- 
based  inference.  First,  in  half  of  the  traces  the  stripes  ex¬ 
hibited  perfect  correlations.  If  this  property  were  repro¬ 
duced  in  stripes  to  multiple  destinations,  their  statistical 
properties  would  be  identical  to  that  of  multicast  traffic  for 
the  purposes  of  link  loss  inference.  Second,  in  traces  with 
imperfect  correlations,  the  conditional  transmission  proba¬ 
bilities  within  the  stripe  were  considerably  higher  than  the 
marginal  probabilities,  slightly  more  for  the  3  packet  stripe 
than  the  2  packet  stripe.  This  indicates  that  the  bias  due 
to  ignoring  the  imperfection  in  correlations  is  relatively 
small.  Third,  traces  exhibited  coalescence  for  the  stripe 
widths  considered,  indicating  that  the  bias  can  be  compen¬ 
sated  for  by  using  wider  stripes,  although  the  incremental 
benefit  grew  smaller  for  larger  stripe  widths.  These  factors 
lead  us  to  expect  that  striped  unicast  probing  will  be  quite 
effective  for  loss  inference  under  real  network  conditions. 


Fig.  4.  Conditional  Transmission  Probabilities  in 
Simulations.  Scatter-plot  of  conditional  vs.  marginal  link 
transmission  probabilities  for  2,  3  and  4  packet  stripes.  Con¬ 
ditional  probabilities  increase  with  stripe  width. 

IV.  Simulation  Results 
A.  Methodology 

The  experiments  of  Section  III  give  us  confidence  fhaf 
the  statistical  properties  of  stripe  transmission  make  stripes 
suitable  as  probes  for  inference.  However,  the  experiments 
do  not  enable  us  to  corroborate  the  accuracy  of  the  estima¬ 
tors  for  real  network  traffic.  Insfead,  we  employ  simulation 
to  get  a  sense  of  how  accurate  the  estimators  might  be  in 
practice. 

We  used  the  ns  simulation  environment  [17];  this  en¬ 
ables  the  representation  of  transport-protocol  detail  of 
packet  transmissions,  with  packet  loss  due  to  buffer  over¬ 
flows  at  nodes  as  stripes  compete  with  background  traffic. 
The  simulafions  reported  in  this  paper  used  the  topology 
of  Figure  3.  The  different  link  speeds  and  delays  are  in¬ 
tended  to  characterize  low  speed/low  delay  links  at  a  net¬ 
work  edge  connected  by  high  speed/high  delay  links  in  the 
network  interior.  The  goal  is  to  study  the  methodology  in 
a  simplified  environment  to  look  for  major  problems,  not 
to  make  a  definitive  assessment  of  the  methodology. 

Background  traffic  comprised  a  mixture  of  sessions  over 
TCP  and  exponential  on-off  sources.  There  were  on  aver¬ 
age  1 1  sessions  per  link  direction.  The  buffer  on  each  link 
accommodated  20  packets.  Measurement  probes  com¬ 
prised  stripes  with  a  1/rsec  interpacket  time.  Stripes  were 
generated  periodically  with  an  inter-stripe  of  16  msec.  The 
tree  was  covered  by  cycling  through  thirty  stripes  S{i,j) 
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Fig.  5.  Inferred  vs.  actual  link  loss  rates  in  simulations.  3  packet 
and  2  packet  substripes.  Scatter  plot  for  100  experiments. 


0  0.05  0.1  0.15  0.2  0.25 

actual  loss  probability 

Fig.  6.  Inferred  vs.  actual  link  loss  rates  in  simulations.  4 
packet  stripes  and  3  packet  substripes.  Scatter  plot  for  100 
experiments. 


over  pairs  of  distinct  receivers  j.  During  an  experiment, 
each  stripe  was  transmitted  1,000  times.  We  conducted  a 
set  of  100  experiments  using  4  packet  stripes.  To  com¬ 
pare  the  estimator  performance  under  the  different  stripe 
lengths  we  considered  the  2  and  3-packet  substripes  ob¬ 
tained  using  the  first  two  and  three  packets  in  each  stripe. 
In  order  to  evaluate  the  method,  the  inferred  loss  rates  were 
compared  with  internal  link  loss  rate  as  determined  by  in¬ 
strumentation  of  the  simulation.  Link  loss  rates  were  com¬ 
puted  considering  only  the  first  probe  in  the  stripe. 

B.  Conditional  and  Marginal  Transmission  Probabilities 

We  first  examine  the  statistical  properties  of  the  under¬ 
lying  link  loss  processes.  Figure  4  is  a  scatter  plot  of  con¬ 
ditional  vs.  marginal  transmission  probabilities  for  2,  3 
and  4  packet  stripes.  Observe  that  conditional  probabili¬ 
ties  increase  with  stripe  width.  We  summarize  the  likely 
relative  errors  in  each  case  though  the  statistics  of  the  ratio 
(1  —  ^)/(l  —  a)  of  conditional  to  marginal  loss  probabil¬ 
ities.  For  2  packet  stripes  the  median  ratio  was  0.32  (i.e., 
a  relative  error  of  32%).  The  ratio  fell  to  0.20  for  3  packet 
stripes,  and  further  to  0.12  for  4-packet  stripes. 

These  errors  are  somewhat  greater  than  those  observed 
for  end-to-end  transmission  in  the  network  experiments. 
We  believe  this  may  be  associated  with  a  greater  hetero¬ 
geneity  in  marginal  transmission  rates  that  we  observed 
in  the  simulations;  loss  rates  grew  by  about  30%  between 
successive  positions  for  the  first  4  packets  of  a  stripe.  Re¬ 
call  from  Section  III-B  that  in  the  network  experiments,  the 
largest  such  ratio  was  19%,  and  typical  ratios  were  3%. 


slripe  widfh 

2 

3 

4 

mean 

s.d. 

0.0099 

0.0064 

0.0075 

0.0057 

0.0063 

0.0052 

TABLE  II 

Estimation  Error  in  Simulations  as  Eunction  of 
Stripe  Width.  Mean  and  standard  deviation  of 

ABSOLUTE  DIFFERENCE  BETWEEN  INFERRED  AND  ACTUAL 
LOSS  RATES.  ERRORS  ARE  MINIMIZED  FOR  4-PACKET 
STRIPES. 

The  stronger  growth  in  loss  ratios  along  the  stripe  in  the 
simulations  may  be  due  to  the  larger  size  of  the  stripe  rel¬ 
ative  to  buffer  size  (20  packets)  as  compared  with  that  in 
real  networks. 

C.  Accuracy  of  Inference 

Finally,  we  compare  inferred  and  actual  link  loss  rates 
in  the  simulations.  We  display  scatter  plots  of  inferred  vs. 
actual  loss  for  2  and  3  packet  stripes  in  Figure  5,  and  3  and 
4  packet  stripes  in  Figure  6.  The  same  number  of  stripes 
was  used  in  each  case.  From  the  figures  we  observe  fhaf 
accuracy  increases  wifh  wider  packef  sfripes  as  exhibifed 
by  fhe  clusfering  abouf  fhe  line  y  =  x.\n  Table  II  we  sum¬ 
marize  fhe  sfafisfics  of  fhe  absolufe  error,  i.e.,  fhe  modulus 
of  fhe  difference  befween  fhe  inferred  and  acfual  link  loss 
rales.  This  is  jusl  under  1%  in  fhe  worsl  case,  i.e.,  for  fhe  2 
packef  slripe,  and  0.63%  in  fhe  besl  case,  i.e.,  fhe  4  packef 
slripe.  Thus,  by  exploiting  fhe  coalescence  properly,  we 
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ha  ve  achieved  a  40%  reduction  in  absolute  error,  by  sim¬ 
ply  increasing  the  the  stripe  length  from  two  to  four. 

V.  Conclusions  and  Further  Work 

In  this  paper  we  have  proposed  a  method  of  using  end- 
to-end  unicast  probing  to  infer  the  loss  characteristics  of 
the  network  interior.  The  method  relies  on  using  collec¬ 
tions  of  unicast  probes,  called  stripes,  dispatched  back-to- 
back  to  different  destinations,  in  order  to  mimic  the  effect 
of  a  notional  multicast  packet  that  followed  the  same  path. 
We  infer  internal  loss  rates  by  applying  an  estimator  devel¬ 
oped  for  multicast  inference  to  the  unicast  receiver  traces. 
This  estimator  is  unbiased  when  the  transmissions  of  a 
stripe’s  probes  on  a  given  link  are  perfectly  correlated.  Im¬ 
perfect  correlations  lead  to  bias,  but  we  prove  that  this  can 
be  compensated  for  by  using  wider  stripes,  provided  that 
the  stripe  transmissions  obey  a  certain  correlation  property 
that  we  call  coalescence.  This  is  the  property  that  success¬ 
ful  transmission  of  a  given  packet  in  the  stripe  becomes 
more  likely  when  more  other  packets  from  the  stripe  have 
been  successfully  transmitted. 

Our  network  experiments  show  that  for  end-to-end 
transmission,  correlations  within  stripes  are  very  high, 
even  perfect  in  some  cases.  Moreover,  the  coalescence 
property  was  found  to  hold  in  virtually  all  cases  examined. 
Together  these  properties  lead  us  to  expect  that  inference 
from  striped  unicast  probes  will  be  effective  in  estimating 
link  loss  rates. 

Our  next  step  in  network  experimentation  is  to  directly 
assess  the  method  by  performing  corroborative  measure¬ 
ments  in  the  network  interior.  This  entails  taking  measure¬ 
ments  on  paths  over  which  probe  traffic  flows;  then  com¬ 
paring  actual  loss  rates  with  inferred  loss  rates  on  internal 
paths. 

Currently,  such  corroboration  is  available  to  us  only  in 
simulation  experiments.  The  ns  simulations  showed  good 
agreement  between  inferred  and  actual  loss  rates;  the  typ¬ 
ical  error  in  these  experiments  was  about  1%  for  the  2- 
packet  stripe,  falling  to  0.63%  when  the  stripe  width  was 
increased  to  4. 

Our  next  step  in  simulation  will  be  to  investigate  the 
magnitude  of  these  effects  for  systems  with  larger  buffers 
and  more  diverse  background  traffic,  which  are  more  rep¬ 
resentative  of  actual  networks. 

In  this  paper  we  have  concentrated  on  estimation  of  link 
probabilities  for  the  first  packet  of  a  stripe.  However,  due 
to  heterogeneity  of  loss  along  the  stripe,  such  estimates 
may  not  be  representative  of  typical  packets,  e.g.,  pack¬ 
ets  contained  within  a  burst.  Clearly,  the  present  method 
could  be  extended,  through  use  of  other  stripe  templates, 
to  estimate  link  probabilities  for  packet  in  positions  other 


than  the  first.  In  the  future  we  hope  to  increase  the  accu¬ 
racy  of  inference  by  tuning  the  stripe  properties  to  the  burst 
structure  observed  in  background  traffic. 

Finally,  we  remark  that  a  number  of  other  multicast- 
based  estimators-namely  those  for  delay  distributions 
[15],  for  delay  variances  [7],  and  logical  multicast  topol¬ 
ogy  [3]-have  the  potential  to  be  adapted  in  the  same  man¬ 
ner  as  was  done  for  loss  estimators  in  this  paper.  We  feel 
that  our  promising  results  on  unicast-based  loss  estimation 
warrant  extending  the  estimator  to  these  other  settings. 
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Abstract 

There  has  been  considerable  activity  recently  to  develop  monitoring  and  debugging  tools  for  a  mul¬ 
ticast  session  (tree).  With  these  tools  in  mind,  we  focus  on  the  problem  of  how  to  lay  out  multicast 
sessions  so  as  to  cover  a  set  of  links  of  interest  within  a  network.  We  define  three  variations  of  this 
layout  (cover)  problem  that  differ  in  what  it  means  for  a  link  to  be  covered.  We  then  focus  on  the  identi- 
fiability  problem,  to  determine  whether  a  given  set  of  candidate  multicast  trees  can  cover  the  set  of  links 
of  interest;  and  the  minimum  cost  problem,  to  determine  the  minimum  cost  set  of  trees  that  cover  the 
links  in  question.  We  establish  efficient  algorithms  to  solve  the  identifiability  problem  and  show  that, 
with  few  exceptions,  the  minimum  cost  problems  are  NP-hard  and  that  even  finding  an  approximation 
within  a  certain  factor  is  NP-hard.  One  exception  is  when  the  underlying  network  topology  is  a  tree.  For 
this  case,  we  demonstrate  an  efficient  algorithm  that  finds  the  optimal  solution.  We  also  present  several 
computationally  efficient  heuristics  and  their  evaluation  through  simulation.  We  find  that  two  heuristics, 
a  greedy  heuristic  that  combines  sets  of  trees  with  three  or  fewer  receivers,  and  a  heuristic  based  on 
generalizing  our  tree  algorithm,  both  perform  reasonably  well.  The  remainder  of  the  paper  applies  our 
techniques  to  the  vBNS  and  Abilene  networks,  examining  the  effectiveness  of  the  different  heuristics 
and  the  sensitivity  of  the  costs  to  the  choice  of  routing  algorithm. 


1  Introduction 

Multicast  is  a  technology  that  shows  great  promise  for  providing  the  efficient  delivery  of  content  from 
a  single  source  to  many  receivers.  An  interoperable  networking  infrastructure  is  nearly  in  place  (PIM- 
SM/MSDP/MBGP,SSM)  and  the  development  of  mechanisms  for  congestion  control  and  reliable  data  de¬ 
livery  are  well  under  way  [3,  7].  However,  deployment  of  multicast  applications  lags  behind,  in  large  part 
because  of  a  lack  of  debugging  and  monitoring  tools.  Recently,  several  promising  approaches  and  protocols 
have  been  proposed  for  the  purpose  of  aiding  the  network  manager  or  the  multicast  application  designer  in 
this  task.  These  include  the  use  of  end-to-end  measurements  for  inferring  internal  behavior  on  a  multicast 
‘This  work  is  sponsored  in  part  by  the  DARPA  and  Air  Force  Research  Laboratory  under  agreement  F30602-98-2-0238. 
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tree  [1],  the  development  of  the  multieast  route  monitor  (MRM)  protoeol  [2],  and  a  number  of  promising 
fault  monitoring  tools,  [10,  13].  All  of  these  address  the  problem  of  identifying  performanee  and/or  fault 
behavior  on  a  single  multicast  tree. 

Although  eonsiderable  progress  has  been  made  in  developing  tools  for  a  single  tree,  little  attention  has 
been  paid  on  how  to  apply  these  tools  to  monitor  an  entire  network,  or  even  a  subset  of  the  network.  We 
address  this  problem;  namely,  given  a  set  of  links  whose  behavior  is  of  interest,  how  does  one  ehoose  a  set 
of  minimum  eost  multieast  trees  within  the  network  on  whieh  to  apply  these  tools  so  as  to  determine  the 
behavior  of  the  links  in  question?  Resolution  of  this  problem  is  espeeially  important  as  poorly  designed  sets 
of  measurements  ean  easily  overwhelm  network  resourees.  The  ehoiee  of  trees,  of  eourse,  is  determined  by 
the  multieast  routing  algorithm.  This  raises  a  related  question,  namely,  does  the  multieast  routing  algorithm 
even  allow  a  set  of  trees  that  that  will  allow  one  to  determine  the  behavior  of  the  links  of  interest.  We 
refer  to  this  latter  problem  as  the  Multieast  Tree  Identifiability  Problem  (MTIP)  and  the  first  problem  as  the 
Minimum  eost  Multieast  Tree  Cover  Problem  (MMTCP). 

We  refer  to  the  behavior  measurement  of  a  link  as  a  link  measure.  Note  that  solutions  to  MTIP  and 
MMTCP  depend  on  details  of  the  meehanism  used  to  determine  the  link  measures.  Consequently,  we 
introduee  three  versions  of  these  problems,  the  weak,  strong,  and  medium  eover  problems.  Briefly,  the  weak 
eover  problem  is  based  on  the  assumption  that  it  is  suffieient  that  eaeh  link  of  interest  appear  in  at  least 
one  tree.  The  strong  eover  problem  requires  that  eaeh  link  oeeur  between  two  branching  points  in  at  least 
one  tree.  The  medium  cover  problem  relaxes  this  last  requirement  and  instead  requires  that  the  set  of  trees 
covering  the  link  provide  enough  information  to  determine  the  link  measure  of  interest. 

Briefly,  the  paper  makes  the  following  contributions. 

•  We  establish  efficient  algorithms  for  solving  the  identifiability  problems.  These  are  sufficiently  gen¬ 
eral  to  apply  to  unicast-based  end-to-end  measurements  as  well,  e.g.,[ll]. 

•  We  establish  that  the  cover  problems  are  NP-hard  and  that  in  some  cases,  finding  an  approximation 
within  a  certain  factor  of  optimal  is  also  NP-hard.  Thus,  we  also  propose  several  heuristics  and  show 
through  simulation  that  a  greedy  heuristic  that  iteratively  combines  trees  containing  a  small  number 
of  receivers  performs  reasonably  well. 

•  We  provide  polynomial  time  algorithms  that  find  optimal  solutions  for  a  restricted  class  of  network 
topologies,  including  trees.  This  algorithm  can  be  used  to  provide  a  heuristic  for  sparse,  tree  like 
networks.  This  heuristic  is  also  shown  through  simulation  to  perform  well. 

•  We  apply  our  techniques  to  the  vBNS  and  Abilene  networks,  examining  the  effectiveness  of  the  dif¬ 
ferent  heuristics  and  the  sensitivity  of  the  costs  to  the  choice  of  routing  algorithm. 


197 


To  motivate  the  need  for  different  eover  problems,  let  us  foeus  on  the  MINC  approaeh  for  inferring 
link-level  loss  behavior  [1].  Given  a  sender  and  two  or  more  reeeivers,  there  exist  unbiased  estimates  of 
the  paeket  loss  probabilities  on  eaeh  segment  of  the  tree,  whether  it  lies  between  souree  and  braneh  point, 
braneh  points,  or  braneh  point  and  reeeiver.  Consider  the  simple  topology  illustrated  in  Figure  1(a).  Assume 
that  only  the  two  trees  illustrated  in  Figure  1(b)  and  (e)  are  available  to  eover  links.  Suppose  that  we  are 
interested  in  the  loss  behavior  of  the  set  of  links  {(2, 4),  (4,  6),  (4,  7)}.  In  this  ease,  applying  the  MINC 
teehnique  to  the  tree  in  1(b)  will  produee  estimates  for  the  loss  rates  for  those  links.  This  is  an  example 
where  the  tree  provides  a  strong  eover  for  the  links  of  interest.  Suppose  that  we  are  now  interested  in  link 
(3, 4).  Neither  tree  provides  a  strong  eover  for  this  link.  However,  the  tree  in  1(b)  allows  us  to  determine  the 
loss  rate  for  link  (4, 6)  and  the  tree  in  1(e)  allows  us  to  determine  the  loss  rate  of  path  (3, 6).  If  we  assume 
that  loss  events  are  independent  between  links,  then  these  loss  rates  ean  be  used  to  eompute  the  loss  rate  for 
link  (3, 4).  Thus,  the  two  trees  provide  a  medium  eover  of  the  set  of  links  {(2, 4),  (4,  6),  (4,  7),  (3, 4)}. 


Figure  1:  A  simple  topology  (a)  with  two  possible  trees:  (b)  where  2  sends  to  6  and  7,  and  (e)  where  1  sends 
to  5  and  6. 


A  weak  eover  may  suffiee  to  provide  a  warning  that  some  link  is  eausing  a  problem.  This  ean  trigger 
more  bandwidth  and  eomputation  intensive  diagnosties  to  determine  what  link  is  eausing  the  problem. 

A  number  of  end-to-end  measurement  teehniques  have  been  developed  for  network  tomography.  In  addi¬ 
tion  to  the  MINC  approaeh  which  applies  to  multicast,  there  are  several  promising  unicast-based  techniques. 
For  example,  [5,  8]  attempt  to  emulate  multicast  through  the  use  of  packet  pairs  so  that  multicast-based  anal¬ 
ysis  techniques  can  be  applied.  An  algebraic  approach  has  been  proposed  in  [1 1]  which  can  be  used  to  infer 
link  round  trip  times  based  on  path  round  trip  time  measurements.  However,  none  of  these  deal  with  the 
problem  of  designing  measurements  so  as  to  minimize  the  overhead  imposed  on  the  network.  We  believe 
that  our  techniques  can  be  adapted  to  be  used  with  these  measurement  methodologies. 

The  remainder  of  the  paper  proceeds  as  follows.  Section  2  presents  the  model  for  MTIP  and  MMTCP, 
as  well  as  the  three  types  of  covers  we  consider.  Section  3  describes  the  efficient  algorithms  for  solving 
MTIP.  Section  4  introduces  several  approximation  algorithms  and  heuristics  for  MMTCP.  In  Section  5  we 
present  efficient  algorithms  that  find  the  optimal  MMTCP  solution  for  the  special  case  where  the  underlying 
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network  topology  is  a  tree.  Seetion  6  presents  the  results  of  simulation  experiments  on  the  VBNS  and 
Abilene  networks  as  well  as  randomly  generated  networks.  Last,  Seetion  7  eoncludes  the  paper. 

2  Model  and  Assumptions 

We  represent  a  network  77  by  a  directed  graph  N  =  {V (N),  E{N))  where  V (N)  and  E{N)  denote  the  set 
of  nodes  and  links  within  N  respectively.  When  unambiguous,  we  will  omit  the  argument  N.  Our  interest 
is  in  multicast  trees  embedded  within  N.  Let  5  C  V{N)  be  a  set  of  possible  multicast  senders,  and  let 
R  C  V{N)  be  a  set  of  possible  multicast  receivers.  Let  T  =  {V{T),  E{T))  denote  a  directed  (multicast) 
tree  with  a  source  s{T)  and  a  set  of  leaves  r(T).  We  require  that  s{T)  €  5  and  r(T)  C  R.  Let  A  be  a 
mapping  that  takes  a  source  s  E  S  and  receiver  set  r  C  i?  and  returns  a  tree  A{s,r).  In  the  context  of  a 
network,  A  corresponds  to  the  multicast  routing  algorithm.  Examples  include  DVMRP  [9],  and  PIM-DM 
and  PIM-SM  [6].  Let  T(A,  S,R)  =  {A(s,  r)  :  s  E  S,r  C  i?/{s}},  i.e.,  T{A,  S,  R)  is  the  set  of  all  possible 
multicast  trees  that  can  be  embedded  in  N  using  multicast  routing  algorithm  A.  We  shall  henceforth  denote 
T{A,  S,  R)  by  T(5,  R),  omitting  the  dependence  on  A. 

We  now  associate  a  cost  with  a  multicast  tree  T  E  T{S,R).  We  assume  that  it  can  be  expressed  as 

C{T)  =  C^+  ^  Cu  (1) 

leE(T) 

where  the  first  term  can  be  thought  of  as  a  “per  tree  cost”  and  the  second  is  a  “per  link  cost”.  The  two 
problems  of  interest  to  us  are  as  follows: 

Multicast  Tree  Identiflability  Problem.  Given  a  set  of  multicast  trees  C  T{S,R),  and  a  set  of  links 
L  C  E,h  L  identifiable  by  fhe  sef  of  frees 

Minimum  cost  Multicast  Tree  Cover  Problem.  Given  S,R  C  V  and  L  C  E,  what  is  the  minimum  cost 
subset  of  T(5,  R)  sufficient  to  cover  LI  In  other  words,  find  ^T{S,  R)  that  covers  L  and  minimizes 

C'(Tr)  =  ^  C{T) 

We  distinguish  three  types  of  solutions  to  both  of  these  problems.  These  solutions  differ  in  what  exactly 
is  meant  by  a  cover.  We  say  that  a  node  u  is  a  branch  point  in  tree  T  if  u  is  either  a  root  or  a  leaf,  or  v  has 
more  than  one  child.  A  path  /  =  (ui,  U2, . . . ,  u„)  is  said  to  be  a  logical  link  within  T  if  v\  and  Vn  are  branch 
points,  V2t  ■  ■  ,Vn-i  are  not,  and(t;j,Uj+i)  €  E{T),  t  =  1, . . .  ,n  —  1. 

•  Strong  cover:  Given  a  set  of  trees  T',  T'  is  the  strong  cover  of  link  I  =  {u,  v)  if  there  exists  a  T  € 
such  that  both  u  and  v  are  branch  points  in  T.  is  the  strong  cover  of  a  link  set  L  if  V/  E  is  the 
strong  cover  of  1. 
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•  Medium  cover:  Given  a  set  of  trees  let  L*  be  the  set  of  all  paths  for  whieh  is  the  strong  eover, 
and  let  M*  be  a  set  of  observed  link  measures  for  eaeh  I*  €  L*.  We  say  that  is  the  medium  eover 
of  link  I  if  any  observed  M*,  resulting  from  a  situation  where  sueeessful  transmission  over  eaeh  path 
is  possible,  uniquely  determines  the  link  measure  for  1.  We  say  that  is  the  medium  eover  of  a  link 
set  L  if  V/  €  -L,  is  the  medium  eover  for  1. 

•  Weak  cover:  Given  a  set  of  trees  ’S',  'S'  is  the  weak  eover  of  link  I  if  there  exists  a  T  €  sueh  that 
I  E  E{T).  We  say  that  is  the  weak  eover  of  a  link  set  L  li'il  E  L,'^  is  the  weak  eover  of  1. 

We  refer  to  the  problems  of  finding  these  types  of  solutions  as  S-MTIP/S-MMTCP,  W-MTIP/W-MMTCP 
and  M-MTIP/M-MMTCP  respeetively.  Several  eases  are  of  interest  to  us.  One  is  where  L  =  E,  i.e.,  where 
the  objeetive  is  to  eover  the  entire  network.  A  seeond  is  where  L  eonsists  of  one  link,  \L\  =  1.  If,  V/,  we  set 
C{1)  =  0,  the  problem  beeomes  that  of  eovering  the  link  set  L  with  the  set  of  trees  with  minimum  total  per 
tree  eost. 

3  The  Multicast  Tree  Identifiability  Problem 

In  this  seetion,  we  eonsider  the  identifiability  problem  assoeiated  with  the  weak,  strong,  and  medium  eover 
problems.  In  the  ease  of  weak  eovers,  the  identifiability  problem  is  straightforward.  It  suffiees  to  eheek 
whether  eaeh  link  in  L  appears  in  one  of  the  trees  within  ’S'.  The  identifiability  problem  is  also  straightfor¬ 
ward  in  the  ease  of  strong  eovers.  In  this  ease,  L  is  identifiable  iff  eaeh  link  wifhin  L  lies  befween  braneh 
poinfs  in  af  leasf  one  free  wifhin  ’S'. 

The  idenfifiabilify  problem  for  medium  eovers  is  more  inferesfing  and  has  applieafion  fo  a  mueh  larger 
sef  of  nefwork  eharaeferizafion  problems  fhan  fhe  one  being  eonsidered  in  fhis  paper,  e.g.,  [11].  Unlike  fhe 
weak  and  sfrong  eover  problems,  we  have  resulfs  only  for  summable  link  measures.  Forfunafely,  fhis  is  nof 
resfriefive  as  sueh  measures  inelude  loss  probabilities,  average  delays,  delay  varianees,  and  ofhers. 

We  say  fhaf  a  link  measure  is  summable  if,  for  a  pafh  p  eonsisfing  of  fhe  sef  of  links  P, 

j3p  =  ^mu  (2) 

leP 

where  jSp  and  m/  are  fhe  link  measures  for  fhe  pafh  p  and  fhe  link  I,  respeetively.  We  also  require  fhaf 
mi  be  finife  if  if  is  possible  fo  fransmif  aeross  link  1.  Expeefed  link  delay  is  an  example  of  a  summable 
link  measure.  ^  The  logarifhm  of  fhe  probabilify  of  sueeessful  fransmission  is  summable  provided  fhaf  loss 
evenfs  are  independenf  from  link  fo  link.  Note  fhaf  if  if  is  possible  fo  fransmif  on  a  link,  fhen  fhis  link 
'Note  that  for  link  delay,  ’’possible  to  transmit”  means  the  expected  delay  is  finite. 
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measure  is  finite.  Given  the  logarithm  of  the  success  probability,  we  can  of  course  compute  the  loss  or 
success  probability. 

We  focus  now  on  the  identifiability  problem  associated  with  the  medium  problem.  Let  T  €  ’S'.  Define  a 
segment  in  T  to  be  a  path  between  either  the  root  of  T  and  the  closest  branch  point,  two  neighboring  branch 
points,  or  a  branch  point  and  a  leaf  of  T.  Let  5  be  the  set  of  all  segments  within  the  trees  contained  in  ’5^. 
Each  segment  p  €  5  is  a  path  within  The  fundamental  assumption  underlying  our  work  is  that  allows 
us  to  obtain  /3p,  for  all  p  €  5,  i.e.,  5  is  identifiable  from  Note  also  that  5  corresponds  to  a  set  of  logical 
links  that  corresponds  to  the  strong  cover  of  Now,  (2)  holds  for  allp  €  5.  If  we  introduce  the  |5|  x  \E\ 
matrix  B  where  Bij  =  1  if  link  j  belongs  to  segment  i  and  0  otherwise,  then  we  have  the  following  matrix 
equation 

Bm  =  /3  (3) 

where  the  components  of  m  are  and  the  components  of  P  aie  Pf..  We  state  without  proof  the  following 
identifiability  result. 

Theorem  1  Let  ^  be  a  set  of  multicast  trees.  Provided  that  the  link  measures  are  summable,  T'  identifies 
the  link  measures  mi,  I  €  L  iff  for  P  such  that  the  components  correspond  to  the  number  of  links  in  the 
segment,  there  is  a  unique  set  {mi,  I  €  L}  that  satisfies  equation  (3). 

Determination  of  whether  L  is  identifiable  or  not  can  be  efficiently  performed.  Choose  Pk  to  be  the 
number  of  links  in  segment  A;  €  5.  Apply  Gaussian  elimination  to  B  to  obtain  its  reduced  row  echelon  form 
matrix  B'.  L  is  identifiable  if  V/  €  L,  there  is  some  row  in  B'  where  the  column  corresponding  to  link  I  is 
the  only  non-zero  value.  This  corresponds  to  the  existence  of  a  unique  set  {m/,  I  €  L}  satisfying  (3). 

4  Approximating  the  Minimum  cost  Multicast  Tree  Cover  Problem 

We  have  defined  three  types  of  Multicast  Tree  Cover  Problems:  the  S-MMTCP,  the  W-MMTCP,  and  the 
M-MMTCP.  Unfortunately,  as  the  following  theorem  shows,  not  only  are  these  problems  NP-Complete, 
we  cannot  even  expect  to  find  even  a  good  quality  approximation  to  these  problems.  In  particular,  we  can 
demonstrate  the  following  (the  proof  is  deferred  to  the  full  version  of  the  paper): 

Theorem  2  For  each  of  S-MMTCP,  W-MMTCP,  and  M-MMTCP,  it  is  NP-Hard  to  find  a  solution  that  is 
within  a  factor  of{l  —  e)  In  \L\  of  the  optimal  solution,  for  any  e  >  0.  These  problems  are  also  still  NP-Hard 
even  with  the  restriction  that  \/l,C{l)  =  0. 

Since  we  cannot  expect  to  solve  these  problems  exactly,  in  the  remainder  of  this  section,  we  focus  on 
approximation  algorithms  and  heuristics  for  good  solutions.  In  Section  4.1,  we  focus  on  approximation 
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algorithms  for  the  case  where  the  goal  is  to  minimize  the  total  cost  of  setting  up  the  multicast  trees.  We 
provide  polynomial  time  algorithms  that  have  a  provable  bound  on  how  close  to  optimal  the  resulting  solu¬ 
tion  is  for  the  S-MMTCP  and  the  W-MMTCP.  In  Section  4.2,  we  describe  extensions  to  these  algorithms  for 
the  problem  of  approximating  the  general  MMTCP.  The  resulting  algorithms  only  run  in  polynomial  time 
when  the  number  of  possible  receivers  is  0(log  |£^|),  and  thus,  to  approximate  the  general  MMTCP  more 
efficiently,  we  also  propose  a  heuristic  that  always  finds  a  solution  to  the  general  MMTCP  in  polynomial 
time.  In  Section  6,  we  experimentally  verify  the  quality  of  the  solutions  found  by  this  heuristic. 

4.1  Minimizing  the  total  per- tree  cost 

We  first  study  how  to  approximate  the  MMTCP  when  the  goal  is  only  to  minimize  the  total  cost  for  setting 
up  the  multicast  trees  but  not  the  cost  for  multicast  traffic  to  travel  links,  i.e.,  C;  =  0  in  (1).  This  problem 
is  simpler,  since  without  a  cost  for  link  traffic,  if  a  sender  is  performing  a  multicast,  there  is  no  additional 
cost  for  sending  to  every  receiver.  Thus,  we  can  assume  that  any  active  sender  multicasts  to  every  possible 
receiver.  Note,  however,  that  by  Theorem  2,  even  this  special  case  is  NP-Hard  to  approximate  within  better 
than  a  In  \L\  factor.  We  describe  algorithms  for  this  problem  that  achieve  exactly  this  approximation  ratio. 
These  algorithms  rely  on  the  fact  that  when  C{1)  =  0,  V/,  then  both  the  W-MMTCP  and  the  S-MMTCP  can 
be  solved  using  algorithms  for  the  weighted  Set-Cover  problem,  which  is  defined  as  follows: 

•  The  weighted  Set-Cover  problem:  given  a  finite  set  B  =  {ei,  e2, ...,  e,„}  and  a  collection  of  subsets 
of  B,  =  {Bi,B2,  ...,Bn},  where  each  Bi  C  B  is  associated  with  a  weight  Wi,  find  the  minimum 
total  weight  subset  T  =  {Bi,  ...,Bk}  C  such  that  each  e,  €  5  is  contained  in  some  Bj  €  ^F. 

To  use  algorithms  for  the  weighted  Set-Cover  problem  to  solve  the  S-MMTCP  or  W-MMTCP,  we  simply 
set  B  =  L,  Bi  =  {I  ^  L  such  that  I  is  in  the  cover  (strong  or  weak,  respectively)  produced  by  Tj,  where 
Tj  is  the  tree  produced  by  sender  i  multicasting  to  every  receiver}.  The  weight  of  B^  is  the  per-tree  cost 
of  multicasting  from  sender  i.  Any  solution  to  the  resulting  instance  of  the  weighted  Set-Cover  problem 
produces  a  S-MMTCP  (W-MMTCP,  resp.)  solution  of  the  same  cost.^  Using  this  idea,  we  introduce  two 
algorithms  for  the  MMTCP:  a  greedy  algorithm  modeled  after  a  weighted  Set-Cover  algorithm  analyzed  by 
Chvatal  [4],  and  an  algorithm  that  uses  0-1  integer  programming,  constructed  using  a  weighted  Set-Cover 
algorithm  analyzed  by  Srinivasan  [12]. 

Greedy  algorithm:  The  intuition  behind  the  greedy  algorithm  is  simple.  Assume  first  that  the  per-tree  cost 
is  the  same  for  every  multicast  tree.  In  this  case,  the  algorithm,  at  every  step,  chooses  the  multicast  tree  that 
covers  the  most  remaining  uncovered  links.  This  is  repeated  until  the  entire  set  of  links  is  covered.  When 
^Note  that  this  technique  does  not  work  for  the  case  of  a  medium  cover,  since  the  set  of  links  in  the  medium  cover  of  a  tree  T, 
will  depend  on  what  other  trees  are  also  being  used. 
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different  trees  have  a  different  per-tree  eost,  then  instead  of  maximizing  the  number  of  new  links  eovered, 
the  algorithm  maximizes  the  number  of  new  links  eovered,  divided  by  the  eost  of  the  tree.  Intuitively,  this 
maximizes  the  ’’profit  per  unit  eost”  of  the  new  tree  that  is  added. 

The  details  of  the  algorithm  are  shown  in  Figure  2.  This  algorithm  is  easily  seen  to  run  in  polynomial 
time  for  all  three  types  of  eovers.  For  the  W-MMTCP  and  S-MMTCP  ,  Theorem  3  provides  a  bound  on  how 
good  an  approximation  the  algorithm  produees.  This  algorithm  ean  also  be  used  to  solve  the  M-MMTCP  . 
We  verify  the  quality  of  the  approximation  obtained  for  the  M-MMTCP  experimentally. 

1.  Apply  the  multicast  routing  protocol  A  to  compute  the  set  of  all  multicast 

trees  from  a  source  in  S  to  every  receiver  in  R.  \fTi  €  ^ ,  set  its  cost  Ci—C°(Ti). 

2.  Set  J  —  —  0,L  —  0. 

3.  If  L  —  L,  then  stop  and  output  J. 

4.  \/Ti  €  J\J ,  set  Li  —  Cov{JU{Ti},COVER) .  Find  Ti  €  J\J  that  maximizes  \{LinL)\L\/ci . 

5.  L  —  LU{LinL),  J  —  J  U  {Ti}  .  Go  to  step  3. 


Figure  2:  The  greedy  algorithm  to  approximate  MMTCP. 


Theorem  3  For  any  instance  I  of  S-MMTCP  or  W-MMTCP  with  C(l)  =  0,  V/,  the  greedy  algorithm  finds 
a  solution  of  cost  at  most  (ind  +  |  +  •  OPT,  where  d  =  min  (|L|,  mctx"_^  1-^*1).  ^nd  OPT  is  the  cost 

of  the  optimal  solution  to  I. 

We  see  from  Theorems  2  and  3  that  the  performanee  of  the  greedy  algorithm  is  the  best  that  we  ean 
hope  to  aehieve.  However,  these  theorems  only  apply  to  the  worst  ease  performanee;  for  the  average  ease, 
the  performanee  may  be  mueh  better,  and  the  best  algorithm  may  be  something  eompletely  different.  We 
investigate  this  issue  further  by  introdueing  a  seeond  approximation  algorithm,  based  on  0-1  integer  pro¬ 
gramming.  We  shall  see  in  Seetion  6  that  the  0-1  integer  programming  algorithm  performs  better  than  the 
greedy  algorithm  in  some  eases.  The  details  of  this  algorithm  are  omitted  from  this  version  of  the  paper;  we 
here  only  state  the  following  theorem  that  demonstrates  how  good  a  solution  is  provided  by  this  approaeh. 

Theorem  4  For  any  instance  I  of  either  the  S-MMTCP  or  the  W-MMTCP,  the  0-1  linear  programming 
algorithm  finds  a  solution  of  cost  at  most  OPT(l  -f  0(max{ln(m/OPT),  -^ln(m/OPT)})),  where  OPT 
is  the  optimal  solution  to  I. 

We  also  point  out  that  in  the  Set-Cover  problem,  if  we  let  d  =  max  |Pj|,  then  even  for  d  =  3,  the 
set  cover  problem  is  still  NP-hard.  However,  it  can  be  solved  in  polynomial  time  provided  that  d  =  1  or 
d  =  2.  Since  we  can  transform  the  S-MMTCP  and  the  W-MMTCP  to  Set-Cover  problems,  we  know  that  the 
S-MMTCP  and  the  W-MMTCP  can  be  solved  in  polynomial  time  given  that  max,  |L  n  P(Tj)|  <  2  where 
Ti  is  the  tree  produced  by  sender  i  multicasting  to  every  receiver. 
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4.2  Minimizing  the  total  cost 

We  next  look  at  the  general  MMTCP,  where  the  goal  is  to  minimize  the  total  eost.  In  addition  to  the  per-tree 
eost  of  the  multieast  trees  used  in  the  eover,  this  ineludes  the  eost  of  multieast  traffie  traveling  on  the  links 
used  by  the  trees.  Both  the  greedy  algorithm  and  0-1  integer  programming  algorithm  ean  be  extended  to 
approximate  the  general  problem.  For  the  greedy  algorithm,  we  simply  replaee  the  first  step  with: 

1.  Apply  the  multicast  routing  protocol  A  to  compute  the  set  of  all  multicast 

trees  from  a  sauce  in  S  to  any  subset  of  R.  VTi  €  ^ ,  compute  its  cost  a  —  C{Ti) . 

The  bound  from  Theorem  3  on  the  quality  of  approximation  aehieved  by  the  greedy  algorithm  also 
applies  to  this  more  general  algorithm.  However,  the  greedy  algorithm  has  a  running  time  that  is  polynomial 
in  |T^|.  For  the  algorithm  of  Seetion  4.1,  =  |5|,  whieh  results  in  a  polynomial  time  algorithm,  but  for 

the  more  general  algorithm  eonsidered  here,  =  \S\  ■  (2 1^1  —  1).  Thus,  the  more  general  approximation 
algorithm  only  has  a  polynomial  running  time  when  \R\  =  0(log  n),  where  n  is  the  size  of  the  input  to  the 
MMTCP  problem  (i.e.,  the  deseription  of  N,  S,  R  and  L).  The  analogous  faets  also  apply  to  the  0-1  integer 
programming  algorithms. 

In  order  to  eope  with  large  values  of  |i?|  in  the  general  MMTCP,  we  also  introduee  the  fast  greedy 
heuristie,  whieh  always  runs  in  polynomial  time.  Fast  greedy  is  like  the  greedy  algorithm,  exeept  that 
instead  of  eonsidering  all  possible  multieast  trees  (i.e.,  every  tree  from  a  sender  to  a  subset  of  the  reeeivers), 
it  restriets  itself  to  only  those  trees  that  eontain  3  reeeivers  (or,  in  the  ease  of  a  weak  eover,  1  reeeiver).  There 
will  be  at  most  a  polynomial  number  of  sueh  trees.  Fast  greedy  then  uses  the  greedy  strategy  to  ehoose  a 
subset  of  these  trees  eovering  all  required  links,  and  then  merges  the  trees  with  the  same  sender.  The  details 
of  this  heuristie  are  deseribed  in  Figure  3.  We  shall  see  in  Seetion  6  that  the  performanee  of  the  fast  greedy 
heuristie  is  often  elose  to  that  of  the  greedy  algorithm. 

5  Finding  the  optimal  solution  in  tree  topologies 

We  saw  in  Theorem  2  that  we  eannot  hope  to  find  an  effieienf  algorifhm  fhaf  solves  any  of  version  of  fhe 
MMTCP  in  general.  However,  fhis  does  nol  rule  ouf  fhe  possibility  fhaf  if  is  possible  fo  solve  fhese  problems 
effieienlly  on  eerfain  elasses  of  nefwork  fopologies.  In  fhis  seefion,  we  sfudy  fhe  MMTCP  in  fhe  ease  fhaf  fhe 
underlying  nefwork  fopology  N  is  a  free.  This  is  mofivafed  by  fhe  hierarehieal  sfruefure  of  real  nefworks, 
whieh  ean  be  fhoughf  of  as  having  a  free  topology  wifh  a  small  number  of  exfra  edges.  We  shall  see  in 
Seefion  6  fhaf  algorifhms  for  fhe  free  fopology  ean  be  adapfed  to  provide  a  very  effeefive  heurisfie  for  sueh 
hierarehieal  topologies.  We  use  fhis  heurisfie  fo  provide  good  solufions  fo  fhe  W-MMTCP  problem  for  fhe 
fopologies  of  fhe  vBNS  nefwork,  as  well  as  fhe  Abilene  nefwork. 
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1.  If  COVER  =  'strong'  or  'medium',  apply  the  multicast  routing  protocol  A  to 
compute  the  set  of  all  multicast  trees  that  have  one  sender  in  S  and  three 
receivers  in  R. 

If  COVER  =  'weak',  apply  the  multicast  routing  protocol  A  to  compute  the  set 
of  all  multicast  trees  (paths)  that  have  one  sender  in  S  and  one  receiver  in  R. 
VTi€^',  compute  its  cost  Ci  —  C{Ti)  . 

2.  For  all  Ti€^,  set  Ei  -  E{Ti) . 

3.  Set  J  =  «',i  =  0,L  =  0. 

4.  If  L  =  L,  then  aggregate  all  the  trees  in  J  who  share  the  same  source  node 
and  output  J.  Stop. 

5.  For  all  Ti  S  J  \  J ,  set  —  J  U  {T,}  and  aggregate  all  the  trees  in  sharing  a 
common  source  as  one  tree.  Set  Li  —  Cov(A’i,  COVER) . 

Find  Ti€.J\J  that  maximizes  \{Li  L)\L\/ci . 

6.  L  —  Lyj  {Lir\L) ,  J  —  J  yj  {Ti}  .  For  each  Tj  €  J  \  J ,  if  Tj  shares  the  same  source  with 

Ti ,  Ej  —  Ej  \  Ei ,  Cj  —  ^'(0  .  Go  to  step  4  . 

Figure  3:  Fast  greedy  heuristie  to  approximate  MMTCP. 


Our  algorithm  for  the  tree  topology  is  guaranteed  to  find  the  optimal  solution  in  polynomial  time.  We 
shall  deseribe  this  algorithm  for  the  (easier)  ease  of  the  W-MMTCP.  In  order  to  deseribe  this  algorithm 
more  eoneisely,  we  shall  make  some  simplifying  assumptions.  In  partieular,  we  assume  that  N  is  a  rooted 
binary  tree,  that  the  per  tree  eost  of  every  multieast  tree  is  zero,  that  C{{a,b))  =  C{{b,a))  and  that  the 
eover  requirement  on  a  link  ean  be  satisfied  from  eifher  direefion.  For  fhe  W-MMTCP,  fhese  assumpfions 
ean  be  removed  by  making  fhe  algorifhm  slighfly  more  eomplieafed,  buf  wifhouf  signiheanfly  inereasing  fhe 
running  fime  of  fhe  algorifhm.  For  fhe  S-MMTCP,  all  of  fhe  assumpfions  ean  be  removed  exeepf  for  fhe 
assumption  fhaf  N  is  a  binary  free. 

The  algorifhm,  based  dynamie  programming,  sfarfs  by  ereafing  a  fable,  wifh  one  row  in  fhe  fable  for  eaeh 
link  I  of  fhe  free,  and  |5p  enfries  in  eaeh  row,  labeled  for  0  <  f,  j  <  |5|.  For  link  I  eonneefing 

nodes  u,v  ^  L,  where  u  is  eloser  fo  fhe  roof,  lef  5T/  be  fhe  subfree  roofed  af  node  v,  fogefher  wifh  link  I  and 
node  u.  The  value  eompufed  for  enfry  is  fhe  minimum  possible  fofal  eosf  for  fhe  free  5T/  (removed 

from  fhe  resf  of  fhe  nefwork),  subjeef  fo  fhe  following  eondifions:  all  of  fhe  links  fhaf  are  required  fo  be 
eovered  in  N  are  eovered  in  5T/,  u  is  a  souree  fhaf  generafes  j  mulfieasf  sessions  fhaf  are  roufed  aeross  I, 
and  u  is  also  a  reeeiver  fhaf  reeeives  i  mulfieasf  sessions.  If  fhere  are  less  fhan  i  senders  in  5T/  —  u,orj  >  0 
and  fhere  are  no  reeeivers  in  5T/  —  u,  fhen  we  eall  fhe  pair  {i,j)  invalid  for  link  I,  and  fhe  value  of  enfry 
is  sef  fo  infinify. 

We  eompufe  fhe  values  in  fhe  fable  one  row  af  a  time,  in  deereasing  order  of  fhe  disfanee  of  fhe  eorre- 
sponding  links  from  fhe  roof.  When  I  is  eonneefed  fo  a  leaf  of  fhe  free  N,  if  is  slraighlforward  fo  eompufe 
for  all  f,  j,  sinee  if  (f,  j)  is  valid,  fhen  =  (i  +  j)Ci.  We  now  show  how  fo  eompufe  fhe 

remaining  enfries  for  a  link  I  eonneefing  nodes  u  and  v,  where  u  is  eloser  fo  fhe  roof,  and  v  is 

eonneefed  fo  fwo  links  m  and  n,  as  depiefed  in  Figure  4.  Sinee  m  and  n  are  furfher  from  fhe  roof  fhan  I,  we 
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can  assume  that  C<^rn,irr,,jm>  and  C<^ri,in,jn>  have  already  been  computed,  for  0  <  injn  <  |-S'|. 

We  see  that  =  min(C'<^,j,„ +  {i  +  j)  *  C{1)),  where  the  minimum  is  taken 

over  all  im-,jm-,in-,jn  that  provide  valid  multicast  flows  through  the  node  v.  Which  values  of  the  flows 
are  valid  is  checked  using  an  algorithm  described  in  Figure  14.  For  space  reasons,  this  algorithm  has  been 
deferred  to  the  appendix.  Along  with  the  value  we  store  the  values  of  the  imijm-,  in  and  that 

resulted  in  the  minimum  Call  these  values  the  optimal  indices  for  If  {i,j)  is  invalid  for 

link  I,  is  set  to  infinity.  Also,  if  link  I  is  in  the  to  be  covered  set  of  links,  C'</^o,o>  is  set  to  oo.  By 

proceeding  in  this  fashion  from  the  leaves  to  the  root,  we  see  that  we  can  fill  in  the  entire  table. 

To  complete  the  algorithm,  we  attach  a  virtual  link  x  to  the  root  of  N  with  C{x)  =  0,  and  use  the  same 
technique  to  compute  C'<a:,o,o>-  The  minimum  cost  for  covering  the  given  set  of  required  links  in  N  is 
C'<a:,o,o>  ■  In  order  to  find  the  actual  multicast  trees,  we  first  follow  the  stored  optimal  indices  from  C'<a:,o,o> 
to  the  leaves  of  the  tree  to  determine  the  actual  optimal  number  of  flows  in  either  direction  on  each  link  of 
the  tree.  Given  this  information,  a  simple  greedy  algorithm  finds  a  set  of  multicast  trees  that  results  in  this 
number  of  flows.  The  description  of  this  greedy  algorithm  is  left  to  the  full  version  of  the  paper.  We  present 
the  details  of  our  algorithm,  which  we  call  Tree- Optimal,  in  Figure  5. 

Theorem  5  The  algorithm  Tree-Optimal  finds  the  optimal  cost  of  a  solution  to  the  W-MMTCP  in  any 
binary  tree  in  0{\E\\S\^)  steps. 

The  proof  is  provided  in  the  appendix.  Following  a  similar  idea,  we  can  also  construct  an  algorithm  for 
solving  the  S-MMTCP.  However,  the  strong  cover  requirements  make  that  algorithm  somewhat  more  com¬ 
plicated  than  the  algorithm  presented  here. 

In  order  to  deal  with  the  case  that  N  is  a  tree  of  arbitrary  degree  (instead  of  just  a  binary  tree),  we 
transform  N  into  a  binary  tree  N'.  To  do  so,  choose  any  node  to  be  the  root  of  the  tree  N.  Then,  given 
a  node  r  with  n  children,  create  a  virtual  node  r',  make  it  the  child  of  node  r  and  assign  zero  cost  to  the 
virtual  link  {r,r').  Then  pick  any  n  —  1  other  links  attached  to  node  r,  and  move  them  to  node  r'.  We 
repeat  this  process  until  all  nodes  are  binary;  the  resulting  tree  is  N'.  Since  the  cost  of  traveling  any  virtual 
link  is  zero,  and  no  virtual  link  is  in  the  set  of  to  be  covered  links,  the  cost  of  an  optimal  solution  for  the 
topology  N  is  the  same  as  the  optimal  cost  for  the  topology  N'.  Thus,  we  can  find  the  optimal  solution 
for  N  by  transforming  N  into  the  topology  N',  finding  an  optimal  solution  for  the  topology  N',  and  then 
transforming  the  solution  for  N'  into  an  equal  cost  solution  for  N. 

We  also  can  deal  with  the  general  problem  where  each  multicast  tree  has  a  per  tree  cost.  We  do  this  by 
adding  a  virtual  node  u  and  virtual  link  {u,  v)  for  each  source  candidate  v,  and  replacing  the  source  candidate 

V  with  the  virtual  node  u.  We  then  assign  the  cost  for  initializing  the  multicast  tree  that  was  rooted  at  node 

V  as  the  cost  of  the  link  (w,  v) .  The  problem  then  becomes  that  of  finding  the  optimal  cost  covering  without 
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per  tree  eosts  in  the  resulting  network.  If  we  also  want  to  allow  the  eost  of  initializing  different  multieast 
trees  at  the  same  souree  to  vary,  we  ean  attaeh  a  different  virtual  node  u  for  eaeh  multieast  tree  rooted  at  v. 


Figure  4:  A  link  I  ineident  to  the  same  node  as  links  m  and  n. 


1.  Add  a  virtual  link  x  whose  downstream  node  is  the  root  of  the  tree  and  assign 
zero  cost  for  traveling  link  x. 

2.  Create  a  table  C,  with  each  row  labeled  with  a  link  I,  and  each  column  la¬ 
beled  by  <i,j>,  for  0<i,f<|5|.  Initialize  every  entry  in  the  table  to  -|-oo . 

3.  For  each  link  I,  let  SUi  and  SDi  be  the  number  of  sources  above  and  below  I 
respectively.  Let  RUi  and  RDi  be  the  number  of  receivers  above  and  below  I  re¬ 
spectively. 

4.  Vh ,  such  that  is  a  link  attached  to  a  leaf  node: 

if  {SD1.——I)  then  Cci.,o,iy  —  C{li) 

if{RDi.——l)  then  —  C{Li)  *  {j  +  k),yj,  k,0  <  j  <  SUi,0  <  k  <  1.  endif 

else 

if  {RDi^  ——  1)  then  Cci.,j,o>  —  C{Li)  0  <  j  <  SUi  endif 

if  (k  €  C)  then  C<i;,o,o>  = +00  else  C<i;,o,o>  ==  0  endif 

5.  Choose  any  link  ,  such  that  row  has  not  been  computed,  and  is  incident 
to  a  vertex  that  is  also  incident  to  links  Ij  and  Ik  such  that  rows  Ij  and  Ik  have 
been  computed.  Let  v  be  the  node  they  share. 

C<i 

i5p3)43>  ~  min(C<, 

j,P2,<l2>  +  C^i  fc5pl54l>  +  C{li)  *  (p3  -h  qz)) , 

where  valid{ps,qs,P2,q2,Pi,qi,v)  is  true. 

If  Hi  €  £■)  then  C<i;,o,o>  = +00  endif 

6.  If  all  the  links  are  done,  then  stop  and  return  C<^x,o,o>  ■  Else  go  to  step  5. 

Figure  5:  The  algorithm  Tree-Optimal.  This  algorithm  finds  the  eost  of  the  optimal  solution  to  the  W- 
MMTCP  on  a  binary  tree  topology.  The  funetion  valid  is  deseribed  in  the  appendix. 


6  Experiments  and  Findings 

In  this  seetion,  we  explore  the  effeetiveness  of  the  heuristies  presented  in  the  previous  two  seetions.  We 
eonsider  two  settings.  First,  we  eonsider  the  two  existing  Internet2  baekbone  networks,  vBNS  and  Abilene, 
where  we  study  not  only  the  effeetiveness  of  our  heuristies,  but  also  the  effeet  that  the  routing  protoeol  has 
on  measurement  eost.  Seeond,  as  these  networks  are  rather  sparse,  we  also  apply  and  assess  our  teehniques 
to  a  set  of  denser,  randomly  generated  graphs. 
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6.1  Internet2  networks 


We  consider  the  two  Internet2  backbone  networks,  vBNS  [15]  and  Abilene  [16].  Both  networks  maintain 
native  IP  multicast  services  using  the  PIM  sparse-mode  routing  algorithm.  Since  the  experimental  results 
on  the  Abilene  network  are  similar  to  that  on  vBNS,  in  this  section,  we  only  present  the  experimental  results 
from  vBNS.  The  vBNS  multicast  topology  (as  of  October  25,  1999)  is  illustrated  in  [15].  It  consists  of  160 
nodes  and  165  edges;  each  node  represents  a  router  and  each  edge  represents  a  link  connecting  a  pair  of 
routers.  The  link  bandwidths  vary  between  45M  (DS3)  and  2.45G  (OC48).  In  our  experiments,  we  assume 
that  the  cost  of  using  a  link  for  measurement  within  one  multicast  session  is  inversely  proportional  to  its 
bandwidth.  In  addition,  we  assume  that  only  the  leaves  in  the  topology  (i.e.,  node  of  degree  one)  can  be  a 
sender  or  a  receiver. 

6.1.1  Tree  heuristic 

In  section  5  we  proposed  the  algorithm  Tree-Optimal  that  is  guaranteed  to  find  optimal  solutions  in  poly¬ 
nomial  time  for  any  tree  topology.  We  propose  and  study  a  heuristic  based  on  that  algorithm.  This  heuristic 
uses  the  observation  that  the  topology  of  networks  such  as  vBNS  and  Abilene  is  very  close  to  a  tree.  Further¬ 
more,  the  bandwidth  of  the  small  number  of  links  that  create  cycles  tends  to  be  high,  and  thus  presumably 
have  low  cost. 

The  heuristic  can  be  applied  to  any  topology,  and  proceeds  in  four  phases.  In  the  first  phase,  the  topol¬ 
ogy  is  converted  into  a  tree  by  condensing  every  cycle  into  a  single  super-node.  In  the  second  phase,  the 
algorithm  Tree-Optimal  is  run  on  the  resulting  tree.  This  gives  us  a  set  of  multicast  trees,  defined  by  a  lisf 
of  senders,  and  for  each  sender  a  lisf  of  receivers.  In  fhe  fhird  phase,  fhis  sef  of  mulficasf  frees  is  mapped 
back  fo  fhe  original  fopology,  so  fhaf  fhe  same  sef  of  senders  each  send  fo  fheir  respective  receivers.  This 
is  guaranfeed  fo  cover  all  of  fhe  required  links  fhaf  were  nol  condensed  info  super-nodes,  buf  may  leave 
required  links  fhaf  appear  in  cycles  uncovered.  The  fourlh  and  final  phase  uses  fhe  fasl  greedy  heuristic  fo 
cover  any  such  edge. 

Nofe  fhaf  fhe  cosf  of  fhe  solufion  obfained  by  Tree-Optimal  is  a  lower  bound  on  the  cost  of  the  solution 
to  the  actual  topology.  This  implies  several  important  properties  of  this  heuristic.  Call  any  link  that  appears 
on  a  cycle  in  the  graph  a  cycle  link.  If  all  of  the  cycle  links  have  zero  cost,  and  no  cycle  link  must  be  covered, 
then  the  tree  heuristic  is  guaranteed  to  produce  an  optimal  solution.  Also,  if  there  are  not  many  cycle  links, 
or  they  all  have  relatively  small  cost,  then  the  solution  found  by  the  heuristic  will  usually  be  close  to  the 
lower  bound,  and  thus  close  to  optimal.  The  fact  that  the  heuristic  produces  this  lower  bound  is  also  useful, 
as  it  allows  one  to  estimate  how  close  to  optimal  the  solution  produced  by  the  heuristic  is. 
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6.1.2  Core  based  tree  versus  source  based  tree 


PIM-SM  can  be  used  to  either  generate  a  shared  tree  anchored  at  a  rendezvous  point  (RP)  or  to  generate 
source-based  trees.  In  this  section,  we  examine  how  the  choice  of  tree  type  impacts  the  set  of  links  that 
can  be  identified,  as  well  as  the  minimum  cost  required  to  cover  a  set  of  links.  Henceforth,  we  use  the 
generic  terms  core  and  core-based  tree  in  place  of  RP  and  shared  tree.  The  core  based  tree  can  be  either 
bidirectional  where  packets  can  travel  both  up  and  down  a  tree  or  unidirectional  where  packets  must  first 
be  transmitted  to  the  core  before  they  are  then  transmitted  down  to  the  receivers.  For  simplicity,  we  only 
consider  bidirectional  trees. 

Identifiability.  We  examine  how  well  core  based  trees  can  identify  links  relative  to  source  based  trees 
as  a  function  of  the  number  n,  where  n  =  \R\.  We  consider  the  case  where  R  =  S.  We  randomly 
choose  the  n  nodes  from  the  set  of  130  leaves  of  the  vBNS  topology  with  equal  probability.  Given  a  set  of 
senders/receivers,  we  form  source  based  trees  and  core  based  trees  and  count  the  numbers  of  edges  that  each 
type  of  tree  can  cover.  We  construct  two  sets  of  core  based  trees:  one  set  is  constructed  under  the  assumption 
that  all  trees  can  only  share  a  single  core,  whereas  the  other  set  is  constructed  such  that  trees  can  choose  any 
router  within  a  sub-net  of  the  vBNS  network. 

We  here  focus  on  the  strong  cover  and  plot  the  ratio  of  the  number  of  edges  covered  by  core  based  trees 
(CBT)  to  the  number  of  edges  covered  by  source  based  trees  (SBT)  in  Figure  6.  When  there  is  only  one 
core,  source  based  trees  can  cover  more  edges  than  the  core  based  trees  but  the  difference  becomes  less  as 
the  number  of  participants  increases.  When  core  based  trees  have  multiple  cores  to  choose  from,  core  based 
trees  can  cover  more  edges  than  source  based  trees  when  the  number  of  participants  is  small.  Similar  results 
are  observed  from  experiments  where  the  weak  cover  is  required. 
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Figure  6:  Identifiability:  Core  Based  Trees  versus  Source  Based  Trees  (Strong  Cover) 


Cost.  Since  the  tree  heuristic  provides  a  lower  bound  on  the  cost  of  a  weak  cover,  we  use  the  tree  heuristic 
to  compare  the  costs  of  covering  a  given  set  of  links  using  core  based  trees  and  source  based  trees.  We 
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assume  that  every  leaf  in  the  vBNS  topology  can  be  used  as  a  source  and/or  a  receiver  for  constructing 
multicast  trees.  By  varying  the  links  to  be  covered,  we  generate  ten  problem  instances.  More  precisely,  we 
first  determine  the  set  of  edges  which  can  be  covered  by  both  source  based  trees  and  core  based  trees.  We 
then  randomly  choose  30%  of  these  edges  to  be  covered,  where  each  edge  is  equally  likely  to  be  chosen.  For 
each  problem  instance,  we  use  the  tree  heuristic  to  compute  the  cost  of  covering  the  given  set  of  links  if  we 
are  required  to  use  source  based  trees,  core  based  trees  with  single  core  and  core  based  trees  with  multiple 
cores  respectively.  Figure  7  plots  the  costs  of  these  different  trees  as  well  as  lower  bounds  on  the  costs.  For 
each  instance,  the  cost  are  normalize  by  the  lower  bound  on  the  cost  of  source  based  trees.  We  observe  that 
source  based  trees  are  cheaper  when  only  a  single  core  is  available  for  constructing  core  based  trees,  but 
if  multiple  cores  are  available,  then  the  cost  of  covering  a  given  set  of  links  is  similar  for  the  two  routing 
strategies,  with  the  core  based  trees  being  slightly  cheaper. 
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Figure  7:  Cost:  Core  Based  Trees  versus  Source  Based  Trees  (Weak  Cover) 


6.1.3  Effectiveness  of  heuristics 

In  Section  4  we  introduced  the  greedy  algorithm  and  the  0-1  integer  programming  algorithm  for  approxi¬ 
mating  S-MMTCP  and  W-MMTCP,  and  described  their  worst  case  approximation  ratio  bounds.  In  order  to 
approximate  the  general  MMTCP  in  polynomial  time,  we  also  proposed  a  fast  greedy  heuristic  in  Section 
4.  In  this  section  we  study  the  average  performance  of  these  algorithms  and  heuristics  through  experiments 
on  the  Internet2  backbone  networks.  Since  the  topologies  of  these  networks  are  close  to  tree  topologies,  we 
include  the  performance  of  the  tree  heuristic  on  Intemet2  backbones  in  our  study. 

To  create  a  suite  of  problem  instances,  we  varied  the  sizes  of  the  source  and  receiver  candidate  sets.  In 
addition,  for  a  particular  pair  of  source  candidate  set  and  receiver  candidate  set,  we  chose  the  size  of  the  set 
of  links  that  must  be  covered  to  be  proportional  to  the  size  of  the  set  of  links  that  the  source  candidate  set  and 
the  receiver  candidate  set  can  identify.  For  each  problem  size,  we  generated  100  random  problem  instances 
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for  the  vBNS  multicast  network.  For  each  of  these  problem  instances,  we  determined  the  cost  of  the  solution 
found  by  each  algorithm.  We  assumed  that  all  the  multicast  trees  have  the  same  fixed  initialization  cost. 
W-MMTCP  and  S-MMTCP.  We  ran  the  algorithms  on  inputs  where  the  number  of  source  candidates  is 
eight  and  the  number  of  receiver  candidates  varies  from  eight  to  sixteen.  For  small  problem  instances  such 
as  these,  the  optimal  solutions  can  be  computed  for  these  problem  sizes  using  exhaustive  search,  and  this  can 
be  used  to  check  the  quality  of  the  approximation  results.  For  both  W-MMTCP  and  S-MMTCP,  we  used  the 
0-1  integer  programming,  greedy  and  fast  greedy  algorithms  to  approximate  the  100  problem  instances  on 
vBNS  for  each  of  the  problem  size.  In  addition,  we  used  the  tree  heuristic  to  approximate  the  W-MMTCP. 
We  compare  the  performance  of  the  algorithms  for  S-MMTCP  in  Figure  8  and  W-MMTCP  in  Figure  9. 
In  both  figures,  fhe  rafio  of  fhe  solufions  found  by  fhe  approximafion  algorifhm  fo  fhe  opfimal  solufions 
is  ploffed.  For  each  approximafion  algorifhm,  we  sorf  fhe  rafios  in  ascending  order.  Thus,  for  example, 
problem  insfance  1  for  each  algorifhm  represenfs  fhe  insfance  where  fhaf  algorifhm  performed  fhe  closesf 
fo  opfimal,  and  may  correspond  fo  differenl  inpufs  for  fhe  differenl  algorifhms.  We  presenf  plofs  for  fwo 
differenl  problem  sizes:  8  sources  and  8  receivers,  as  well  as  8  sources  and  16  receivers. 

In  Figure  8,  if  is  surprising  fo  see  fhaf  fhe  fasl  greedy  algorifhm  produces  fhe  same  solufion  as  fhe  greedy 
algorifhm  and  fhaf  fhe  0-1  integer  programming  algorifhm  yields  fhe  opfimal  solufion  on  mosf  inpufs  when 
fhe  problem  size  is  small.  As  fhe  problem  size  increases,  fhe  0-1  infeger  programming  is  less  likely  fo 
produce  fhe  opfimal  solution  and  fhe  difference  befween  fhe  fasl  greedy  and  fhe  opfimal  seems  fo  increase 
slowly. 

In  fhe  case  of  approximafing  W-MMTCP,  fhe  free  heurislic  oul-performs  bolh  fhe  greedy  algorifhm  and 
fhe  fasl  greedy  algorifhm  on  mosf  problem  inslances.  The  qualify  of  fhe  0-1  infeger  programming  algorifhm 
decreases  as  fhe  problem  size  increases.  The  resulfs  from  fhe  fasl  greedy  are  only  slighlly  worse  lhan  Ihose 
from  fhe  greedy  algorifhm.  The  difference  befween  fhe  fasl  greedy  algorifhm  and  greedy  algorifhm  seems 
fo  change  very  slowly  as  fhe  problem  size  increases. 

M-MMTCP  The  complexify  of  fhe  exhauslive  search  algorifhm  fo  compule  fhe  optimal  solufion  for  fhe  M- 
MMTCP  is  high.  Thus,  we  can  only  compule  fhe  optimal  solution  for  very  small  problem  sizes.  Instead  of 
showing  fhe  ratio  of  fhe  solufion  relumed  by  our  heuristics  fo  fhe  optimal  solufion,  we  focus  on  comparing 
fhe  resulfs  from  fhe  greedy  algorifhm  and  fhe  fasl  greedy  heurislic.  In  Figure  10,  we  plol  fhe  rafio  of 
solufions  from  fhe  fasl  greedy  heuristic  and  fhe  greedy  algorifhm.  We  sorf  fhe  rafios  in  ascending  order.  The 
difference  in  fhe  qualify  of  solution  befween  fhe  greedy  algorifhm  and  fhe  fasl  greedy  heurislic  seems  fo 
increase  slowly  as  fhe  problem  size  increases.  The  fasl  greedy  heurislic  can  produce  belter  resulfs  lhan  fhe 
greedy  algorifhm  on  a  small  number  of  inslances. 


211 


8  sources  and  8  receivers 


8  sources  and  16  receivers 


Figure  8:  Comparison  of  approximation  algorithms  for  the  S-MMTCP  on  vBNS. 


Figure  9:  Comparison  of  approximation  algorithms  for  the  W-MMTCP  on  vBNS. 


8  sources  and  8  receivers  8  sources  and  16  receivers 

Figure  10:  Comparison  of  approximation  algorithms  for  the  M-MMTCP  on  vBNS. 

6.2  Experiments  on  dense  networks 

Both  vBNS  and  Abilene  are  quite  sparse,  i.e.,  each  network  only  contains  a  very  small  number  of  additional 
edges  than  a  tree  topology  containing  the  same  number  of  nodes.  In  this  section,  we  investigate  how  our 
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algorithms  perform  on  denser  network  than  vBNS  and  Abilene.  Unfortunately,  we  had  no  such  multicast 
topologies  available  to  us.  Instead,  we  make  use  of  randomly  generated  topologies. 

We  generated  ten  100-node  transit-stub  undirected  graphs  using  GT-ITM  (GT  internetwork  topology 
model).  For  more  details  about  the  transit-stub  network  model,  please  refer  to  [14].  The  average  out-degree 
is  in  the  range  of  [2.5, 5].  We  assigned  two  costs  to  each  edge  in  the  graphs,  one  for  each  direction.  These 
cost  are  uniformly  distributed  in  [2, 20].  By  randomly  picking  the  source  candidate  set,  receiver  candidate 
set  and  to-be  covered  set  of  links  and  then  assigning  costs  to  the  edges,  we  generated  ten  problem  instances 
for  each  graph.  We  ran  all  algorithms  on  a  total  of  100  problem  instances  and  compared  their  performance. 
We  assumed  that  all  the  multicast  trees  have  the  same  fixed  initialization  cost. 


8  sources  and  8  receivers  8  sources  and  16  receivers 


Figure  11:  Comparison  of  approximation  algorithms  for  the  S-MMTCP  on  100-node  transit-stub. 


8  sources  and  8  receivers 


8  sources  and  16  receivers 


Figure  12:  Comparison  of  approximalion  algorilhms  for  Ihe  W-MMTCP  on  vBNS  on  100-node  Iransil-slub. 
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8  sources  and  8  receivers 


40  60 

Problem  Instances 


8  sources  and  16  receivers 


Figure  13:  Comparison  of  approximation  algorithms  for  the  M-MMTCP  on  vBNS  on  100-node  transit-stub. 


6.2.1  S-MMTCP  and  W-MMTCP 

We  ran  the  algorithms  on  inputs  where  the  number  of  source  candidates  is  eight  and  the  number  of  receiver 
candidates  varies  from  eight  to  sixteen.  We  used  the  0-1  integer  programming,  greedy  and  fast  greedy 
algorithms  to  approximate  the  100  problem  instances  for  each  of  the  problem  sizes.  We  compare  the  perfor¬ 
mance  of  the  algorithms  for  the  S-MMTCP  in  Figure  1 1  and  the  W-MMTCP  in  Figure  12.  In  both  figures, 
the  ratio  of  the  solution  found  by  the  approximation  algorithms  to  the  optimal  solution  is  plotted.  For  each 
approximation  algorithm,  we  sorted  the  ratios  ascendantly. 

In  Figure  1 1,  it  is  surprising  to  see  that  the  fast  greedy  algorithm  produces  the  same  solution  as  the  greedy 
algorithm  and  the  0-1  integer  programming  yields  the  optimal  solution  on  most  inputs.  As  the  problem  size 
increases,  the  greedy  and  fast  greedy  algorithm  are  less  likely  to  produce  the  optimal  solution. 

In  the  case  of  approximating  W-MMTCP,  the  0-1  integer  programming  algorithm  yields  the  optimal 
solution  on  about  half  the  problem  instances.  However,  it  yields  worse  results  than  the  greedy  and  fast 
greedy  algorithm  for  about  forty  percent  of  the  problem  instances.  The  results  from  the  fast  greedy  are 
slightly  worse  than  those  from  the  greedy  algorithm  in  most  of  cases.  The  difference  between  the  fast 
greedy  algorithm  and  greedy  algorithm  seems  to  increase  very  slowly  as  the  problem  size  increases. 


6.2.2  M-MMTCP 

We  also  ran  the  algorithms  for  the  M-MMTCP  on  inputs  where  the  number  of  source  candidates  is  eight  and 
the  number  of  receiver  candidates  varies  from  eight  to  sixteen.  We  compare  the  results  from  the  greedy  algo¬ 
rithm  and  the  fast  greedy  heuristic.  In  Figure  13,  we  plot  the  ratio  of  solutions  from  the  fast  greedy  heuristic 
and  the  greedy  algorithm.  We  sort  the  ratios  in  ascending  order.  The  fast  greedy  heuristic  yields  the  same 
results  as  the  greedy  algorithm  on  more  than  sixty  percent  of  the  problem  instances.  In  approximately  five 
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percent  of  the  problem  instances,  the  fast  greedy  heuristic  yields  better  solutions  than  the  greedy  algorithm. 
The  difference  in  the  quality  of  solution  between  the  greedy  algorithm  and  the  fast  greedy  heuristic  seems 
to  increase  slowly  as  the  problem  size  increases. 

7  Conclusions 

In  this  paper  we  focussed  on  the  problem  of  selecting  trees  from  a  candidate  set  in  order  to  cover  a  set 
of  links  of  interest.  We  identified  three  variation  of  this  problem  according  to  the  definition  of  cover  and 
addressed  two  questions  for  each  of  them: 

•  is  it  possible  to  cover  the  links  of  interest  using  trees  from  the  candidate  set? 

•  if  the  answer  to  the  first  question  is  yes,  what  is  the  minimum  set  of  trees  that  can  cover  the  links? 

We  proposed  computationally  efficient  algorithms  for  the  first  of  these  questions.  We  also  established,  with 
some  exceptions,  that  determining  the  minimum  cost  set  of  trees  is  a  hard  problem.  Moreover,  it  is  a  hard 
problem  even  to  develop  approximate  solutions.  One  exception  is  when  the  underlying  topology  is  a  tree  in 
which  case  we  present  efficient  dynamic  programming  algorithms  for  two  of  the  covers.  We  also  proposed 
several  heuristics  and  showed  through  simulation  that  a  greedy  heuristic  that  combines  trees  with  three  or 
fewer  receivers  performs  reasonably  well. 

Last,  we  applied  our  methods  to  the  vBNS  and  Abilene  networks  and  observed 

•  a  heuristic  based  on  the  tree  algorithm  gives  excellent  results, 

•  the  cost  of  the  set  of  trees  required  to  cover  a  set  of  links  is  relatively  insensitive  to  the  use  of  either  a 
core  based  tree  or  a  source  based  tree  algorithm. 
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A  Details  of  the  algorithm  Tree-Optimal 

We  first  provide  the  proof  of  Theorem  5,  restated  here  for  convenience. 

Theorem  5  The  algorithm  Tree-Optimal  yinr/i  the  optimal  cost  of  a  solution  to  the  W-MMTCP  in  any  binary 
tree  in  0(|L^||5|®)  steps. 

Proof:  Each  entry  in  the  table  in  row  I  can  be  computed  by  examining  |5|^  pairs  of  entries  in  two  other 
rows  m  and  n,  and  performing  a  constant  number  of  steps  per  pair  of  entries  examined.  The  bound  on 
the  running  time  follows  from  the  fact  that  there  are  |£^|(|5|  +  1)^  table  entries.  In  order  to  prove  that  the 
algorithm  returns  the  correct  answer,  we  show  that  for  all  l,i,j,  including  I  =  x,i  =  j  =  0,  the  correct 
value  of  is  computed.  We  show  this  is  true  for  the  row  corresponding  to  any  link  I  by  a  reverse 

induction  on  the  distance  that  I  is  from  the  link  x.  The  base  cases  are  the  values  computed  for  the  links 
attached  to  leaves  of  the  tree  N,  which  can  be  seen  trivially  to  be  correct. 

For  the  inductive  step,  assume  that  we  want  to  compute  the  row  associated  with  the  link  I,  as  depicted  in 
Figure  4.  By  the  inductive  hypothesis,  we  can  assume  that  the  rows  corresponding  to  the  links  m  and  n  are 
correct.  When  determining  the  correct  value  for  entry  it  was  assumed  that  flows  going  over  fhe 

link  m  originafed  or  ended  af  fhe  node  v.  However,  fhe  cosf  of  fhese  flows  going  over  links  in  fhe  subfree 
STjn  is  nol  affected  by  whefher  or  nol  fhese  flows  also  pass  over  link  I  and/or  links  in  fhe  subfree  5T„. 
Therefore,  for  any  values  of  i  and  j,  fhe  opfimal  cosf  of  all  flows  going  over  links  in  fhe  subfree  5T^  in  fhe 
besf  solufion  for  fhe  free  5T/  fhaf  sends  flows  from  v  across  m  and  receives  flows  lo  v  across  m  is 
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C<m,im,jm>-  Similarly,  the  optimal  cost  of  all  flows  going  over  links  in  the  subtree  5T„  in  the  best  solution 
for  the  tree  5T/  that  sends  flows  from  v  across  n  and  receives  in  flows  to  v  across  n  is  Since 

the  algorithm  Tree-Optimal  tries  all  possible  values  of  flows  across  links  m  and  n,  the  entry  is 

correct,  which  completes  the  induction.  ■ 

We  also  provide  some  additional  details  of  the  algorithm  Tree-Optimal.  In  particular,  we  describe  how 
to  determine,  for  a  set  of  three  links  all  incident  to  the  same  node,  whether  or  not  a  given  specification 
of  flows  along  each  of  the  links  leads  to  a  valid  multicast  flow.  We  assume  that  we  are  given  6  values 
km  loutj  ^nd  Uout  that  describe  the  number  of  flows  in  to  and  out  of  a  node  v  along  the  three 

links  I,  m,  and  n,  all  incident  to  node  v.  We  also  assume  that  we  are  given  a  description  of  the  node  v  that 
allows  us  to  determine  whether  or  not  u  is  a  sender  and/or  a  receiver. 

We  here  provide  an  algorithm  that  requires  determining  if  an  integer  programming  problem  has  a  feasible 
solution.  Standard  techniques  can  solve  an  integer  programming  problem  in  time  that  is  exponential  in  the 
number  of  constraints.  The  number  of  constraints  in  the  algorithm  is  constant,  and  thus  we  can  determine  if 
the  values  of  the  flows  lead  to  a  valid  set  of  multicast  flows  through  the  node  v  in  constant  time.  We  point 
out  that  if  this  integer  program  were  used  as  a  subroutine  for  the  algorithm  Tree-Optimal,  this  would  lead 
to  large  constants  hidden  in  the  asymptotic  notation.  However,  a  more  complicated  technique  can  remove 
these  large  constants. 

Function:  valid{li„,  lout,  ‘mi„,moui,  riotii,  v) 

1.  Set  Sv  to  1  if  V  is  source,  otherwise  set  Sv  to  0. 

2.  Let  Nxy  denote  a  possible  number  of  flows  which  first  traverse  link  x  and 

then  link  y.  Define  the  following  constrains: 

Set  I 

^  hn 

Tlin 

Set  II 


Nlm  +  N„m  <  rriout  ^  Nlm  +  N„m  +  Sv 
T  ^  '^out  ^  T  “H  Sy 

^  lout  ^  “H  Sy 

0  <  Nlm  <  hn 
0  <  Ni„  <  lin 
0  ^  Nmn  ^  triin 
0  ^  Nml  ^  TTlin 
0  <  Nnl  <  riin 
0  ^  Nnm  ^  tlin 

3.  If  V  is  not  a  receiver,  return  true  if  there  is  a  feasible  integer  solution 
for  constraint  sets  I  and  II. 

4.  If  V  is  a  receiver,  return  true  if  there  is  a  feasible  integer  solution  for 
constraint  set  II. 

Figure  14:  How  to  check  if  the  number  of  flows  into  and  out  of  a  node  is  valid 
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Abstract 

We  present  MBone  experiments  that  validate  an 
end-to-end  measurement  technique  we  call  MINC, 
for  Multicast  Inference  of  Network  Characteristics. 
MINC  exploits  the  performance  correlation  experi¬ 
enced  by  multicast  receivers  to  infer  loss  rates  and 
other  attributes  of  internal  links  in  a  multicast  tree. 
MINC  has  two  important  advantages  in  the  Internet 
context:  it  does  not  rely  on  network  collaboration 
and  it  scales  to  very  large  measurements.  In  previ¬ 
ous  work,  we  laid  the  foundation  for  MINC  using 
rigorous  statistical  analysis  and  packet-level  simula¬ 
tion.  Here,  we  further  validate  MINC  by  compar¬ 
ing  the  loss  rates  on  internal  MBone  tunnels  as  in¬ 
ferred  using  our  technique  and  as  measured  using 
the  mtrace  tool.  Inferred  values  closely  matched 
directly  measured  values  -  differences  were  usually 
well  below  1%,  never  above  3%,  while  loss  rates  var¬ 
ied  between  0  and  35%. 

1  Introduction 

As  the  Internet  grows  in  size  and  diversity,  its  inter¬ 
nal  performance  becomes  harder  to  measure.  Any 
one  organization  has  administrative  access  to  only  a 
small  fraction  of  the  network’s  internal  nodes,  while 
commercial  factors  often  prevent  organizations  from 
sharing  internal  performance  data.  End-to-end  mea- 

*This  work  was  sponsored  in  part  by  DARPA  and  the 
Air  Force  Research  Lahoratory  under  agreement  F30602-98-2- 
0238. 


surements  using  unicast  traffic  do  not  rely  on  admin¬ 
istrative  privileges,  but  it  is  difficult  to  infer  link- 
level  performance  from  them  and  they  require  large 
amounts  of  traffic  to  cover  multiple  paths.  There  is 
a  need  for  practical  and  efficient  procedures  that  can 
take  an  internal  snapshot  of  a  significant  portion  of 
the  network. 

We  have  developed  a  measurement  technique  that 
addresses  these  problems.  Multicast  Inference  of 
Network  Characteristics  (MINC)  [11]  uses  end-to- 
end  multicast  traffic  as  measurement  probes.  It  ex¬ 
ploits  the  inherent  correlation  in  performance  ob¬ 
served  by  multicast  receivers  to  infer  the  loss  rate  and 
other  attributes  of  paths  between  branch  points  in  a 
multicast  routing  tree.  These  measurements  do  not 
rely  on  administrative  access  to  internal  nodes  since 
they  are  done  between  end  hosts.  In  addition,  they 
scale  to  large  networks  because  of  the  bandwidth  ef¬ 
ficiency  of  multicast  traffic. 

The  intuition  behind  packet  loss  inference  is  that 
the  event  that  a  packet  has  reached  a  given  internal 
node  in  the  tree  can  be  inferred  from  the  packet’s  ar¬ 
rival  at  one  or  more  receivers  descended  from  that 
node.  Conditioning  on  this  event,  we  can  determine 
the  probability  of  successful  transmission  to  and  be¬ 
yond  the  given  node.  Consider,  for  example,  a  simple 
multicast  tree  with  a  root  node  (the  source),  two  leaf 
nodes  (the  left  and  right  receivers),  a  link  from  the 
source  to  a  branch  point  (the  shared  link),  and  a  link 
from  the  branch  point  to  each  of  the  receivers  (the 
left  and  right  links).  The  source  sends  a  stream  of  se¬ 
quenced  multicast  packet  through  the  tree  to  the  two 
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receivers.  If  a  packet  reaches  either  receiver,  we  can 
infer  that  the  packet  reached  the  branch  point.  Thus 
the  ratio  of  the  number  of  packets  that  reach  both 
receivers  to  the  number  that  reached  only  the  right 
receiver  gives  an  estimate  of  the  probability  of  suc¬ 
cessful  transmission  on  the  left  link.  The  probability 
of  successful  transmission  on  the  other  links  can  be 
found  by  similar  reasoning. 

It  is  not  immediately  clear  whether  this  technique 
applies  to  more  than  just  binary  trees  or  whether 
it  enjoys  desirable  statistical  properties.  In  previ¬ 
ous  work  [2],  we  extended  this  technique  to  general 
trees  and  showed  that  the  estimate  is  consistent,  that 
is,  it  converges  to  the  true  loss  rates  as  the  number 
of  probes  grows.  More  specifically,  we  developed 
a  Maximum  Likelihood  Estimator  (MLE)  for  inter¬ 
nal  loss  rates  in  a  general  tree  assuming  independent 
losses  across  links  and  across  probes.  We  derived 
the  MLE’s  rate  of  convergence  and  established  its 
robustness  with  respect  to  certain  violations  of  the 
independence  assumption.  We  also  validated  these 
analytical  results  using  the  ns  simulator  [15].  We 
give  a  brief  account  of  these  results  in  Section  2.2. 

In  more  recent  work  [3] ,  we  explored  the  accuracy 
of  our  packet  loss  estimation  under  a  variety  of  net¬ 
work  conditions.  Again  using  ns  simulations,  we 
evaluated  the  error  between  inferred  and  actual  loss 
rates  as  we  varied  the  network  topology,  propagation 
delay,  packet  drop  policy,  background  traffic  mix, 
and  probe  traffic  type.  We  found  that,  in  all  cases, 
MINC  accurately  inferred  the  per-link  loss  rates  of 
multicast  probe  traffic. 

In  this  paper,  we  further  validate  MINC  through 
experiments  under  real  network  conditions.  We  used 
a  collection  of  end  hosts  connected  to  the  MBone, 
the  multicast-capable  subset  of  the  Internet  [10].  We 
chose  one  host  as  the  source  of  multicast  probes  and 
used  the  rest  as  receivers.  We  then  made  two  types  of 
measurements  simultaneously:  end-to-end  loss  mea¬ 
surements  between  the  source  and  each  receiver,  and 
direct  loss  measurements  at  every  internal  node  of 
the  multicast  tree.  Einally,  we  ran  our  inference  al¬ 
gorithm  on  the  results  of  the  end-to-end  measure¬ 
ments,  and  compared  the  inferred  loss  rates  to  the 
directly  measured  loss  rates.  Across  all  our  exper¬ 
iments,  the  inferred  values  closely  matched  the  di¬ 
rectly  measured  values.  The  differences  between  the 


two  were  usually  well  below  1%,  never  above  3%, 
while  loss  rates  varied  between  0  and  35%.  Eurther- 
more,  the  inference  algorithm  converged  well  within 
2-minute,  1200-probe  measurement  intervals. 

The  rest  of  this  paper  is  organized  as  follows:  Sec¬ 
tion  2  describes  our  experimental  methodology;  Sec¬ 
tion  3  presents  our  experimental  results;  Section  4 
discusses  our  ongoing  work;  Section  5  surveys  re¬ 
lated  work;  and  Section  6  offers  some  conclusions. 

2  Experimental  Methodology 

During  each  of  our  MBone  experiments,  we  had  a 
source  send  a  stream  of  sequenced  packets  to  a  col¬ 
lection  of  receivers  while  we  made  two  types  of 
measurement  at  each  receiver.  At  the  source,  we 
used  our  mgen  traffic  generation  tool  to  send  one 
40-byte  packet  every  100  milliseconds  to  a  specific 
multicast  group.  The  resulting  traffic  stream  placed 
less  than  4  Kbps  of  load  on  any  one  MBone  link. 
We  reserved  multicast  address  224.2. 130.64  and  port 
22778  for  our  experiments  using  the  sdr  session 
directory  tool  [21].  At  each  receiver,  we  ran  the 
mtrace  [13]  and  mbat  [9]  tools  to  gather  statistics 
about  traffic  on  this  multicast  group.  Below  we  de¬ 
scribe  our  use  of  mtrace  and  mbat  in  more  detail. 

2.1  Direct  measurements 

mtrace  traces  the  reverse  path  from  a  multicast 
source  to  a  receiver.  It  runs  at  the  receiver  and  is¬ 
sues  trace  queries  that  travel  hop-by-hop  up  the  mul¬ 
ticast  tree  towards  the  source.  Each  router  along 
the  path  responds  to  these  queries  with  information 
about  traffic  on  the  specified  multicast  group  as  seen 
by  that  router,  including  counts  of  incoming  and  out¬ 
going  packets,  mtrace  calculates  packet  losses  on 
a  link  by  comparing  the  packet  counts  returned  by 
the  two  routers  at  either  end  of  the  link. 

In  each  of  our  experiments,  we  collected  mtrace 
statistics  for  consecutive  two-minute  intervals  over 
the  course  of  one  hour.  We  ran  a  separate  instance  of 
mtrace  for  each  interval.  Each  mtrace  run  issued 
a  trace  query  at  the  beginning  of  the  interval  and  an¬ 
other  query  at  the  end.  We  thus  measured  link-level 
loss  rates  for  all  thirty  intervals  in  one  hour  as  shown 
in  Eigures  2-4.  These  intervals  are  not  exactly  two 
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Physical  location 

Ahhreviation 

AT&T  Eabs  -  Research,  Florham  Park,  New  Jersey 
Carnegie  Mellon  University,  Pittsburgh,  Pennsylvania 
Georgia  Institute  of  Technology,  Atlanta,  Georgia 

University  of  California,  Berkeley,  California 

University  of  Kentucky,  Eexington,  Kentucky 

University  of  Massachusetts,  Amherst,  Massachusetts 
University  of  Southern  California,  Eos  Angeles,  California 
University  of  Washington,  Seattle,  Washington 

AT&T 

CMU 

GaTech 

UCB 

UKy 

UMass 

use 

UWash 

Table  1:  End  hosts  used  during  our  MBone  experiments. 


Physical  location 

Ahhreviation 

Atlanta,  Georgia 

GA 

Cambridge,  Massachusetts 

MA 

San  Francisco,  California 

CA 

West  Orange,  New  Jersey 

NJ 

Table  2:  Routers  at  multicast  branch  points  during  our  representative  MBone  experiment. 


minutes  long  due  to  delays  incurred  in  collecting  re¬ 
sponses  to  the  queries.  We  recorded  timestamps  for 
the  actual  beginning  and  end  of  each  mtrace  run  to 
help  synchronize  our  inference  calculations  to  these 
direct  measurements. 

We  chose  to  measure  two-minute  intervals  based 
on  our  previous  experience  with  MINC.  Our  simu¬ 
lations  have  shown  that  the  statistical  inference  al¬ 
gorithm  at  the  heart  of  MINC  converges  to  true  loss 
rates  after  roughly  1 ,000  observations  [2] .  Given  the 
100  milliseconds  between  probes  in  our  MBone  ex¬ 
periments,  two  minutes  allow  for  1,200  probes  be¬ 
tween  measurements.  As  shown  in  Figure  5,  1,200 
probes  were  indeed  enough  for  MINC  to  converge. 

It  is  important  to  note  that  mtrace  does  not  scale 
to  measurements  of  large  multicast  groups  if  used  in 
parallel  from  all  receivers  as  we  describe  here.  Par¬ 
allel  mtrace  queries  come  together  as  they  travel 
up  the  tree.  Enough  such  queries  will  overload 
routers  and  links  with  measurement  traffic.  We  used 
mtrace  in  this  way  only  to  validate  MINC  on  rel¬ 
atively  small  multicast  groups  before  we  move  on  to 
use  MINC  alone  on  larger  groups. 


2.2  Statistical  inference 

MINC  works  on  logical  multicast  trees.  A  logical 
tree  is  one  where  all  nodes,  except  the  root  and  the 
leaves,  have  at  least  two  children.  A  physical  tree  can 
be  converted  into  a  logical  tree  by  deleting  all  nodes, 
other  than  the  root,  that  have  only  one  child  and  then 
collapsing  the  links  accordingly.  A  link  in  a  logical 
tree  may  thus  represent  multiple  physical  links.  This 
conversion  is  necessary  because  inference  based  on 
correlation  among  receivers  cannot  distinguish  be¬ 
tween  two  physical  links  unless  these  links  lead  to 
two  different  receivers.  Henceforth  when  we  speak 
of  trees  we  will  be  speaking  of  logical  trees. 

2.2.1  Inference  algorithm 

Our  model  for  loss  on  a  multicast  tree  assumes  that 
packet  loss  is  independent  across  different  links  of 
the  tree,  and  independent  between  different  probes. 
With  these  assumptions,  the  loss  model  is  specified 
by  associating  a  probability  with  each  node  k 
in  the  tree,  is  the  probability  that  a  packet  is 
transmitted  successfully  across  the  link  terminating 
at  node  k,  given  that  it  reaches  the  parent  node  p{k) 
of  k. 
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UKy 


GA 


AT&T  UMass  NJ  USC  UCB  UWash 

/\ 

CMU  GaTech 

Figure  1 :  Multicast  routing  tree  during  our  representative  MBone  experiment. 


When  a  probe  is  transmitted  from  the  source,  we 
can  record  the  outcome  as  the  set  of  receivers  the 
probe  reached.  The  loss  inference  algorithm  is  based 
on  probabilistic  analysis  that  allows  us  to  express  the 
ak  directly  in  terms  of  the  expected  frequencies  of 
such  outcomes.  More  precisely,  for  each  node  k  let 
7/;  denote  the  probability  of  the  outcome  that  a  given 
packet  reaches  at  least  one  receiver  that  has  k  as  an 
ancestor  in  the  tree.  Let  Ak  denote  the  probability 
that  a  given  packet  reaches  the  node  k,  i.e.,  Ak  = 
oikOiki  0ik2  ■  ■  ■  Oikm  where  fci ,  A;2,  •  •  • ,  is  the  chain 
of  m  adjacent  nodes  leading  back  from  node  k  to 
the  root  of  the  tree.  Then  it  can  be  shown  that  Ak 
satisfies 


(1  -  7fc/Afc)  =  n  (1  “  TjMfc)  (1) 

jec{k) 


where  the  product  is  taken  over  all  nodes  j  in  c{k), 
the  set  of  children  of  the  node  k.  It  was  shown  in  [2] 
that  under  generic  conditions  the  A  k  can  be  recov¬ 
ered  uniquely  through  (1)  if  the  7  are  known.  The  ak 
can  in  turn  be  recovered  since  ak  =  Akj Ap^y^ .  Gen¬ 
erally,  finding  Ak  requires  numerical  root- finding  for 
(1).  In  the  special  case  of  a  node  k  with  two  offspring 
i  and  j',  (1)  can  be  solved  explicitly: 


Ak 


7j  +  7T  -  Ik 


(2) 


Suppose  that  in  place  of  the  7/;  in  (1),  we  use  the 
actual  frequencies  jk  with  which  n  probes  reach  at 
least  one  receiver  with  ancestor  k.  We  denote  the 
corresponding  solutions  to  (1)  by  Ak  and  estimate 
the  link  probabilities  by  ak  =  AkjAp^^k)-  The  calcu¬ 
lation  of  the  Jk  is  achieved  though  a  simple  recursion 
as  follows.  Define  new  variables  Yk  (i)  as  function  of 
the  measured  outcomes  of  n  probes  by 


Yk{i) 


1  if  probe  i  reaches  node  k 
0  otherwise 


(3) 


if  A;  is  a  leaf  node,  and 


Yk{i)  =  max  Y^{i)  (4) 

jec{k) 


otherwise.  Then 


1  ” 

%  =  -  (*) 

n  “ 

8  =  1 

We  showed  in  [2]  that  the  estimator  ak  enjoys  two 
useful  properties:  (i)  consistency,  ak  converges  to 
the  true  value  ak  almost  surely  as  the  number  of 
probes  n  grows  to  infinity,  and  (ii)  asymptotic  nor¬ 
mality.  the  distribution  of  the  normalized  difference 
y/n{ak  —  ak)  converges  to  a  normal  distribution  as 
n  grows  to  infinity.  We  also  investigated  in  [2]  the 
effects  of  correlations  that  violate  the  independent 
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Loss  rates  on  link  to  GA  abs(inferred  mtrace)  on  link  to  GA 


(a)  Inferred  vs.  directly  measured  loss  rates  (b)  Difference  between  tbe  two 

Figure  2:  Loss  rates  on  link  to  GA  when  running  mtrace  from  AT&T.  Tbe  two  sets  of  loss  rates  agreed 
closely  over  a  wide  range  of  values.  Differences  remained  below  1.5%  while  loss  rates  varied  between  4 
and  30%. 


loss  assumptions.  Consistency  is  preserved  under  a 
large  class  of  temporal  correlations,  although  con¬ 
vergence  of  the  estimates  with  n  can  be  slower.  Spa¬ 
tial  correlations  perturb  the  estimate  continuously,  in 
that  small  correlations  lead  to  small  inconsistencies. 
When  losses  on  sibling  links  are  correlated  the  per¬ 
turbation  is  a  second-order  effect,  in  that  the  degree 
of  inconsistency  depends  not  on  the  size  of  the  cor¬ 
relations,  but  on  the  degree  to  which  they  change 
across  the  tree. 

Our  earlier  papers  on  MINC  [2,  3]  contain  a  de¬ 
tailed  description  and  analysis  of  the  above  inference 
algorithm,  including  rules  to  handle  special  cases  of 
the  data  in  which  the  generic  conditions  required  for 
the  existence  of  solutions  to  (1)  fail.  In  the  interests 
of  brevity,  we  omit  these  details  from  this  paper. 

2.2.2  Inference  calculations 

We  encoded  our  loss  inference  algorithm  in  a  pro¬ 
gram  called  infer,  infer  takes  two  inputs:  a  de¬ 
scription  of  the  tree  topology  and  a  description  of  the 
end-to-end  losses  experienced  by  each  receiver.  It 
produces  as  output  the  estimated  loss  rates  on  every 
link  in  the  tree. 

We  determined  the  tree  topology  by  combining  the 
mtrace  output  from  all  the  receivers.  Along  with 
packet  counts,  mtrace  reports  the  domain  name 


and  IP  address  of  each  router  on  the  path  from  the 
source  to  a  receiver.  We  built  a  complete  multi¬ 
cast  tree  by  looking  for  common  routers  and  branch 
points  on  the  paths  to  all  the  receivers.  The  topology 
of  the  MBone  is  relatively  static  due  to  that  network’s 
current  reliance  on  manually  configured  IP-over-IP 
tunnels.  These  tunnels  are  themselves  logical  links 
that  may  each  contain  multiple  physical  links.  We 
verified  fhaf  fhe  fopology  remained  consfanf  during 
our  experimenfs  by  inspecting  fhe  pafh  informafion 
we  obtained  every  two  minutes  from  mtrace. 

We  measured  end-to-end  losses  using  the  mbat 
tool,  mbat  runs  at  a  receiver,  subscribes  to  a  spec¬ 
ified  mulficasf  group,  and  collecfs  a  frace  of  the  in¬ 
coming  packet  stream,  including  the  sequence  num¬ 
ber  and  arrival  time  of  each  packet.  We  ran  mbat 
at  each  receiver  for  the  duration  of  each  experiment. 
At  the  conclusion  of  an  experiment,  we  transferred 
the  mbat  traces  and  mtrace  output  from  all  the  re¬ 
ceivers  to  a  single  location. 

There  we  ran  the  loss  inference  algorithm  on  the 
same  two-minute  intervals  on  which  we  collected 
mtrace  measurements.  For  each  receiver,  we  used 
the  timestamps  for  the  beginning  and  endofmtrace 
measurements  to  segment  the  mbat  traces  into  cor¬ 
responding  two-minute  subtraces.  Then  we  ran 
infer  on  each  two-minute  interval  and  compared 
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(a)  Inferred  vs.  directly  measured  loss  rates  (b)  Difference  between  tbe  two 


Figure  3:  Loss  rates  on  link  to  GA  when  running  mtrace  from  USC.  These  measurements  span  different 
two-minute  intervals  than  those  from  AT&T  (see  Fig.  2)  because  of  clock  asynchrony.  Nevertheless,  inferred 
and  directly  measured  loss  rates  agreed  closely.  Differences  were  usually  below  0.5%,  never  above  3%, 
while  loss  rates  varied  between  2  and  35%. 


the  inferred  loss  rates  with  the  directly  measured  loss 
rates.  We  discuss  the  results  in  the  next  section. 

3  Experimental  Results 

We  performed  a  number  of  MBone  experiments  us¬ 
ing  different  multicast  sources  and  receivers,  and 
thus  different  multicast  trees.  Inferred  loss  rates 
agreed  closely  with  directly  measured  loss  rates 
throughout  our  experiments.  Here  we  discuss  re¬ 
sults  from  a  representative  experiment  on  August  26, 
1998.  Tables  1  and  2  list  the  end  hosts  and  branch 
routers  involved  in  this  experiment,  while  Figure  1 
shows  the  resulting  multicast  tree. 

Figure  2  shows  that  inferred  and  directly  mea¬ 
sured  loss  rates  agreed  closely  despite  a  link  expe¬ 
riencing  a  wide  range  of  loss  rates.  In  this  case,  loss 
rates  as  measured  by  mtrace  varied  between  4  and 
30%.  Nevertheless,  differences  between  inferred  and 
directly  measured  loss  rates  remained  below  1.5%. 

Figures  2  -  4  all  show  that  inferred  and  directly 
measured  loss  rates  agreed  closely  despite  imperfect 
synchronization  between  infer  and  mtrace  inter¬ 
vals.  The  two  sets  of  intervals  do  not  always  match 
because  of  variable  network  delays.  The  timestamps 
for  the  beginning  and  end  of  mtrace  intervals  are 


recorded  before  a  trace  query  is  issued  and  after  a 
trace  query  returns,  both  according  to  the  clock  at  the 
relevant  receiver.  However,  the  corresponding  packet 
counts  are  recorded  at  the  time  the  trace  query  arrives 
at  each  router.  Therefore,  although  the  infer  in¬ 
tervals  are  derived  from  the  mtrace  intervals  using 
the  same  receiver  clock,  the  inference  is  not  always 
applied  to  exactly  the  same  1,200  probe  packets  as 
the  direct  loss  measurement.  Nevertheless,  differ¬ 
ences  between  inferred  and  directly  measured  loss 
rates  across  Figures  2-4  were  usually  well  below 
1%,  never  above  3%. 

Along  the  same  lines.  Figures  2  and  3  together 
show  that  inferred  and  directly  measured  loss  rates 
agreed  closely  for  different  two-minute  intervals  on 
the  same  link.  We  have  multiple  sets  of  mtrace 
measurements  for  links  shared  by  multiple  receivers, 
one  set  for  each  receiver.  In  these  cases,  we  can  run 
infer  on  different  sets  of  intervals  corresponding 
to  the  different  sets  of  mtrace  intervals,  mtrace 
intervals  are  different  for  each  receiver  because  of 
clock  asynchrony  between  receivers  and  because  of 
the  variable  network  delays  discussed  above.  Nev¬ 
ertheless,  differences  between  inferred  and  directly 
measured  loss  rates  across  Figures  2  and  3  remained 
below  3%. 
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Loss  rates  on  link  to  CA 


abs(inferred  mtrace)  on  linkto  CA 


(a)  Inferred  vs.  directly  measured  loss  rates 


(b)  Difference  between  tbe  two 


Figure  4:  Loss  rates  on  link  to  CA  when  running  mtrace  from  USC.  CA  experienced  an  order  of  magnitude 
lower  loss  rates  than  GA  (see  Figs.  2  and  3).  Nevertheless,  inferred  and  directly  measured  loss  rates  agreed 
closely.  Differences  were  usually  below  0.5%,  never  above  2%,  while  loss  rates  varied  between  0  and  4%. 


Figure  4  shows  that  inferred  and  directly  measured 
loss  rates  agreed  closely  even  for  links  with  very  low 
loss  rates.  In  this  case,  loss  rates  varied  between  0 
and  4%,  an  order  of  magnitude  lower  than  the  loss 
rates  in  Figure  2.  Nevertheless,  differences  between 
inferred  and  directly  measured  loss  rates  were  usu¬ 
ally  below  0.5%,  never  above  2%. 

Finally,  Figure  5  shows  that  the  inference  algo¬ 
rithm  converged  quickly  to  the  desired  loss  rates. 
Each  inferred  loss  rate  reported  in  Figures  2  -  4  is 
the  value  calculated  by  infer  at  the  end  of  the  cor¬ 
responding  2-minute,  1200-probe  measurement  in¬ 
terval.  However,  infer  outputs  a  loss  rate  value 
for  every  probe.  Figure  5  reports  these  intermediate 
values.  As  shown,  inferred  loss  rates  stabilized  well 
before  a  measurement  intervals  ends.  Our  algorithm 
converged  after  fewer  than  800  probes  for  all  links 
and  all  measurement  intervals  in  our  experiments. 

4  Ongoing  Work 

The  results  reported  in  the  previous  section  suggest 
that  it  is  possible  to  characterize  link-level  loss  based 
on  end-to-end  multicast  measurements.  However,  a 
number  of  issues  need  to  be  resolved  before  the  tech¬ 
nology  can  be  deployed  and  made  available  for  gen¬ 
eral  use. 


First,  there  is  the  question  of  how  end-to-end  mul¬ 
ticast  measurements  are  to  be  generated  and  col¬ 
lected.  We  are  pursuing  two  approaches  for  address¬ 
ing  this  question.  The  first  is  to  add  a  multicast  probe 
capability  to  an  existing  measurement  infrastructure. 
We  are  working  with  the  National  Internet  Measure¬ 
ment  Infrastructure  (NIMI)  [14]  project  to  do  exactly 
this.  Currently  NIMI  permits  users  to  schedule  a  va¬ 
riety  of  unicast  end-to-end  measurements  between 
NIMI  platforms  and  to  download  the  traces  to  a  site 
of  their  choosing.  We  are  augmenting  this  capabil¬ 
ity  to  permit  the  scheduled  execution  of  multicast 
end-to-end  measurements  followed  by  a  distribution 
of  the  traces.  Once  we  have  accomplished  this,  we 
will  also  be  able  to  design  and  execute  a  more  ex¬ 
tensive  set  of  experiments  to  validate  the  inference 
techniques  against  mtrace. 

There  is  one  disadvantage  with  the  above  ap¬ 
proach,  namely  that  the  set  of  links  that  can  be  cov¬ 
ered  is  limited  by  the  number  and  placement  of  NIMI 
nodes.  We  are  investigating  a  second  approach  that 
has  the  potential  of  addressing  that  problem.  The  ba¬ 
sic  idea  is  to  gather  end-to-end  loss  information  for 
multicast  applications,  such  as  teleconferencing  and 
continuous  media  streaming,  that  already  exist  in  the 
network.  This  is  possible  when  the  application  uses 
the  Real-Time  Transport  Protocol  (RTP)  and  its  as- 
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Loss  rates  on  links  from  UKy  to  USC:  #8  Loss  rates  on  links  from  UKy  to  USC:  #24 


Figure  5:  Inferred  loss  rates  on  the  three  links  between  UKy  and  USC  (i.e.,  the  links  to  GA,  CA,  and  USC) 
during  individual  2-minute,  1200-prohe  measurement  intervals.  The  inference  algorithm  converged  well 
before  the  measurement  interval  ended  for  all  links  during  all  measurement  intervals. 


sociated  control  protocol,  RTCP  [20].  Currently,  ap¬ 
plications  using  these  protocols  require  receivers  to 
multicast  loss  and  delay  information  to  each  other. 
Currently  the  loss  information  is  limited,  consisting 
of  short-term  and  long-term  loss  rates,  and  is  not  ad¬ 
equate  for  our  inference  techniques. 

We  plan  to  evaluate  different  ways  of  augmenting 
these  RTCP  loss  reports  to  provide  sufficient  infor¬ 
mation  to  infer  link-level  loss  behavior.  This  would 
allow  a  measurement  node  anywhere  in  the  network 
to  monitor  and  record  the  RTCP  loss  reports  for  var¬ 
ious  applications.  This  measurement  node  could 
identify  the  topology  of  an  application  using  the 
third-party  measurement  feature  of  mtrace  [13], 
then  apply  the  inference  methodology  to  obtain  the 
link-level  behavior.  The  hope  is  that  the  number  of 
participants  in  these  applications  will  be  sufficiently 
large  to  allow  a  measurement  node  to  estimate  the 
loss  behavior  of  a  large  portion  of  the  links  in  the 
network. 

A  second  important  deployment  issue  concerns 
the  need  to  know  the  topology  of  the  multicast  tree 
in  order  to  apply  our  techniques.  Our  current  ex¬ 
periments  use  the  multicast  tree  topology  discovered 
from  executing  mtrace.  Recent  work  has  shown 
that  algorithms  based  on  link-level  loss  estimators 
for  binary  trees  can  be  used  to  infer  the  topology 
of  multicast  trees.  Topology  inference  of  binary  and 


general  trees  was  proposed  in  [19]  and  [4],  respec¬ 
tively.  Although  not  reported  here,  the  latter  algo¬ 
rithms  are  able  to  infer  the  trees  described  in  Sec¬ 
tion  3  with  reasonable  accuracy.  Any  general  pur¬ 
pose  inference  infrastructure,  whether  built  on  top  of 
NIMI  or  RTCP,  should  have  the  flexibility  to  use  both 
mtrace  and  the  topology  inference  algorithms  re¬ 
ported  in  [19,  4]. 

In  addition  to  addressing  the  above  deployment  is¬ 
sues,  we  are  also  exploring  new  application  areas  for 
MINC.  One,  we  are  investigating  extensions  to  our 
inference  methodology  to  estimate  link-level  delay 
behavior.  We  have  developed  prototype  estimators 
for  the  delay  distribution  and  delay  variance  on  in¬ 
ternal  links  of  a  multicast  tree  based  on  end-to-end 
delay  measurements.  We  will  describe  these  results 
in  future  papers.  Two,  we  believe  our  inference  tech¬ 
niques  will  prove  useful  in  reliable  multicast  applica¬ 
tions.  These  applications  need  to  aggregate  receivers 
to  achieve  scalable  loss  recovery.  MINC  could  be 
used  to  group  receivers  that  are  topologically  close 
and  share  loss  performance. 

5  Related  Work 

A  growing  number  of  measurement  infrastructure 
projects  (e.g.,  AMP  [1],  Felix  [6],  IPMA  [7], 
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NIMI  [14],  Surveyor  [22],  and  Test  Traffic  [23])  aim 
to  collect  and  analyze  end-to-end  performance  data 
for  a  mesh  of  unicast  paths  between  a  set  of  partic¬ 
ipating  hosts.  We  helieve  our  multicast-hased  infer¬ 
ence  techniques  would  he  a  valuable  addition  to  these 
measurement  platforms.  As  mentioned  in  the  pre¬ 
vious  section,  we  are  working  to  incorporate  MINC 
capabilities  into  NIMI. 

A  lot  of  recent  experimental  work  has  sought  to 
understand  internal  network  behavior  from  end-to- 
end  performance  measurements  (e.g.,  see  [5,  12,  17, 
18]).  In  particular,  pathchar  [16]  is  under  evalua¬ 
tion  as  a  tool  for  inferring  link-level  statistics  from 
end-to-end  unicast  measurements.  Much  work  re¬ 
mains  to  be  done  in  this  area  and  with  MINC  we  are 
contributing  a  novel  multicast-based  methodology. 

Regarding  multicast-based  measurements,  we 
have  already  described  the  mtrace  tool  [13].  In 
addition,  the  tracer  tool  [8]  performs  topology 
discovery  through  the  use  of  mtrace.  However, 
mtrace  suffers  from  performance  and  applicability 
problems  in  the  context  of  large-scale  Internet  mea¬ 
surements.  First,  as  mentioned  earlier  in  this  paper, 
mtrace  needs  to  run  once  for  each  receiver  in  or¬ 
der  to  cover  a  complete  multicast  tree.  This  behavior 
does  not  scale  well  to  large  numbers  of  receivers.  In 
contrast,  MINC  covers  the  complete  tree  in  a  single 
pass.  Second,  mtrace  relies  on  multicast  routers  to 
respond  to  explicit  measurement  queries.  Although 
current  routers  support  these  queries,  Internet  Ser¬ 
vice  Providers  (ISPs)  may  choose  to  disable  this  fea¬ 
ture  since  it  gives  anyone  access  to  detailed  delay  and 
loss  information  about  paths  inside  their  networks. 
In  contrast,  MINC  does  not  rely  on  cooperation  from 
any  network-internal  elements. 

6  Conclusions 

We  have  presented  experimental  results  that  validate 
the  MINC  approach  to  inferring  link-level  loss  rates 
from  end-to-end  multicast  measurements.  We  com¬ 
pared  loss  rates  in  MBone  tunnels  as  inferred  using 
our  technique  and  as  measured  by  mtrace.  Inferred 
values  closely  matched  directly  measured  values  - 
differences  were  usually  well  below  1%,  never  above 
3%,  while  loss  rates  varied  between  0  and  35%.  In 
addition,  our  inference  algorithm  quickly  converged 


to  the  true  loss  rates  -  inferred  values  stabilized  well 
within  2-minute,  1200-probe  measurement  intervals. 

We  feel  that  MINC  is  an  important  new  methodol¬ 
ogy  for  network  measurement,  particularly  Internet 
measurement.  It  does  not  rely  on  network  coopera¬ 
tion  and  it  scales  to  very  large  networks.  MINC  is 
firmly  grounded  in  statistical  analysis  that  is  backed 
up  by  packet-level  simulations  and  now  experiments 
under  real  network  conditions.  We  are  continuing  to 
extend  MINC  along  both  analytical  and  experimental 
fronts. 
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Multicast-Based  Inference  of  Network-Internal 
Characteristics:  Accuracy  of  Packet  Loss 

Estimation 

R.  Caceres  N.G.  Duffield  J.  Horowitz  D.  Towsley  T.  Bu 


Abstract — We  explore  the  use  of  end-to-end  multicast  traffic  as  measure¬ 
ment  probes  to  infer  network-internal  characteristics.  We  have  developed 
in  an  earlier  paper  [2]  a  Maximum  Likelihood  Estimator  for  packet  loss 
rates  on  individual  links  based  on  losses  observed  by  multicast  receivers. 
This  technique  exploits  the  inherent  correlation  between  such  observations 
to  infer  the  performance  of  paths  between  branch  points  in  the  multicast 
tree  spanning  the  probe  source  and  its  receivers.  We  evaluate  through  anal¬ 
ysis  and  simulation  the  accuracy  of  our  estimator  under  a  variety  of  net¬ 
work  conditions.  In  particular,  we  report  on  the  error  between  inferred 
loss  rates  and  actual  loss  rates  as  we  vary  the  network  topology,  propaga¬ 
tion  delay,  packet  drop  policy,  background  traffic  mix,  and  probe  traffic 
type.  In  all  but  one  case,  estimated  losses  and  probe  losses  agree  to  within 
2  percent  on  average.  We  feel  this  accuracy  is  enough  to  reliably  identify 
congested  links  in  a  wide-area  internetwork. 

Keywords — Internet  performance,  end-to-end  measurements.  Maximum 
Likelihood  Estimator,  tomography 

I.  Introduction 
A.  Background  and  Motivation 

Fundamental  ingredients  in  the  successful  design,  control  and 
management  of  networks  are  mechanisms  for  accurately  mea¬ 
suring  their  performance.  Two  approaches  to  evaluating  net¬ 
work  performance  have  been  (i)  collecting  statistics  at  inter¬ 
nal  nodes  and  using  network  management  packages  to  gener¬ 
ate  link-level  performance  reports;  and  (ii)  characterizing  net¬ 
work  performance  based  on  end-to-end  behavior  of  point-to- 
point  traffic  such  as  that  generated  by  TCP  or  UDR  A  significant 
drawback  of  the  first  approach  is  that  gaining  access  to  a  wide 
range  of  internal  nodes  in  an  administratively  diverse  network 
can  be  difficult.  Introducing  new  measurement  mechanisms  into 
the  nodes  themselves  is  likewise  difficult  because  it  requires  per¬ 
suading  large  companies  to  alter  their  products.  Also,  the  com¬ 
position  of  many  such  small  measurements  to  form  a  picture  of 
end-to-end  performance  is  not  completely  understood. 

Regarding  the  second  approach,  there  has  been  much  recent 
experimental  work  to  understand  the  phenomenology  of  end- 
to-end  performance  (e.g.,  see  [1],  [3],  [15],  [20],  [22],  [23]). 
A  number  of  ongoing  measurement  infrastructure  projects  (Fe¬ 
lix  [6],  IPMA  [8],  NIMI  [14]  and  Surveyor  [31])  aim  to  collect 
and  analyze  end-to-end  measurements  across  a  mesh  of  paths 
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between  a  number  of  hosts,  pathchar  [11]  is  under  evalua¬ 
tion  as  a  tool  for  inferring  link-level  statistics  from  end-to-end 
point-to-point  measurements.  However,  much  work  remains  to 
be  done  in  this  area. 

In  a  recent  paper  [2],  we  considered  the  problem  of  character¬ 
izing  link-level  loss  behavior  through  end-to-end  measurements. 
We  presented  a  new  approach  based  on  the  measurement  and 
analysis  of  the  end-to-end  loss  behavior  of  multicast  probe  traf¬ 
fic.  The  key  to  this  approach  is  that  multicast  traffic  introduces 
correlation  in  the  end-to-end  losses  measured  by  receivers.  This 
correlation  can,  in  turn,  be  used  to  infer  the  loss  behavior  of  the 
links  within  the  multicast  routing  tree  spanning  the  sender  and 
receivers.  Our  principal  analytical  tool  is  a  Maximum  Likeli¬ 
hood  Estimator  (MLE)  of  the  link  loss  rates.  This  estimate  is 
derived  under  the  assumption  that  link  losses  are  described  by 
independent  Bernoulli  losses.  The  data  for  this  inference  is  a 
record  of  which  of  n  probes  were  observed  at  each  of  the  re¬ 
ceivers.  We  have  shown  that  these  estimates  are  strongly  con¬ 
sistent  (converge  almost  surely  to  the  true  loss  rates).  Moreover, 
the  asymptotic  normality  property  of  MLEs  allows  us  to  de¬ 
rive  an  expression  for  their  rate  of  convergence  to  the  true  rates 
as  n  increases.  The  presence  of  spatial  and  temporal  correla¬ 
tion  between  losses  would  violate  the  assumptions  of  the  model. 
However,  we  showed  in  [2]  that  spatial  correlations  deform  the 
Bernoulli  based  estimator  continuously  (i.e.  small  correlations 
give  rise  to  only  small  inaccuracies).  Moreover,  the  deformation 
is  a  second  order  effect  in  that  it  depends  only  on  the  change  in 
loss  correlations  between  different  parts  of  the  network.  Tem¬ 
poral  correlations  do  not  alter  the  strong  consistency  of  the  esti¬ 
mator;  they  only  slow  the  rate  of  convergence. 

We  envisage  deploying  inference  engines  as  part  of  a  mea¬ 
surement  infrastructure  comprised  of  hosts  exchanging  probes 
in  a  wide-area  network  (WAN).  Each  host  will  act  as  the  source 
of  probes  down  a  multicast  tree  to  the  others.  A  strong  advan¬ 
tage  of  using  multicast  rather  than  unicast  traffic  is  efficiency. 
N  multicast  servers  produce  a  network  load  that  grows  at  worst 
linearly  as  a  function  of  N .  On  the  other  hand,  the  exchange 
of  unicast  probes  can  lead  to  local  loads  which  grow  as  de¬ 
pending  on  the  topology. 

B.  Contribution 

Whereas  the  experimental  component  of  our  previous  work 
focused  on  comparing  inferred  and  actual  probe  losses,  the  fo¬ 
cus  of  this  paper  is  on  asking  how  close  are  the  inferred  losses 
to  those  of  background  traffic.  We  do  this  under  a  variety  of  net¬ 
work  configurations.  These  are  specified  by  varying  the  follow¬ 
ing:  (i)  network  topology  (ii)  background  traffic  mix  (iii)  packet 
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Fig.  1 .  A  two-leaf  logical  multicast  tree 


drop  policy  (iv)  probe  traffic  type,  and  (v)  network  propagation 
delay.  In  analyzing  potential  differences  between  inferred  and 
actual  losses  we  identify  three  potential  causes. 

The  first  is  the  statistical  variability  expected  on  the  basis 
of  the  loss  model.  The  general  theory  of  MLE's  furnished  the 
asymptotic  variance  of  the  estimators  as  the  number  of  probes 
grows.  These  tell  us  how  many  probes  must  be  used  in  order 
to  achieve  measurements  of  a  desired  level  of  accuracy.  It  can 
be  shown  that  the  asymptotic  variance  of  each  estimated  loss 
probability  is,  to  first  order,  equal  to  the  true  loss  probability 
and  otherwise  independent  of  the  topology.  The  role  of  such 
theoretical  values  is  to  establish  a  baseline  for  variance  of  loss 
estimates  of  background  traffic. 

The  second  potential  cause  of  differences  is  the  non¬ 
conformance  of  probe  losses  to  the  Bernoulli  model.  In  practice 
we  find  quite  close  agreement  between  inferred  and  actual  probe 
losses.  An  examination  of  the  underlying  loss  process  shows 
that  deviations  from  the  Bernoulli  model  are  quite  small.  The 
correlation  between  packet  losses  on  different  links  is  usually 
less  than  O.I. 

The  main  contribution  to  the  difference  comes  from  differ¬ 
ences  in  the  loss  patterns  exhibited  by  probe  and  background 
traffic.  We  have  mainly  used  TCP  background  traffic  in  the 
simulations,  reflecting  the  dominant  use  of  TCP  as  a  transport 
protocol  on  the  Internet  [32].  However,  TCP  flows  are  known 
to  exhibit  correlations.  A  well-known  example  of  this  is  syn¬ 
chronization  between  TCP  flows  which  can  occur  as  a  result  of 
slow  start  after  packet  loss  [10].  This  mechanism  can  be  ex¬ 
pected  to  give  rise  to  spatial  and  temporal  correlations  between 
losses.  However,  we  believe  that  large  and  long-lasting  spatial 
dependence  is  unlikely  in  a  real  network  because  of  traffic  het¬ 
erogeneity.  In  our  experiments  we  investigated  the  effects  of  two 
different  discard  methods:  Drop  from  Tail  and  Random  Early 
Detection  (RED)  [7].  One  of  the  motivations  for  the  introduc¬ 
tion  of  RED  has  been  to  break  dependence  introduced  through 
TCP. 

The  choice  of  probe  process  is  one  means  by  which  we  can 
aim  to  improve  the  accuracy  of  inference.  A  constraint  on  the 
interprobe  time  is  that  probe  traffic  should  not  itself  contribute 
noticeably  to  congestion.  Beyond  the  question  of  the  mean, 
the  choice  of  interarrival  time  distribution  can  affect  the  bias 
and  variance  of  the  MLE.  Probes  with  exponentially  distributed 
spacings  will  see  time  averages;  this  is  the  PASTA  property 
(Poisson  Arrivals  See  Time  Averages;  see  e.g.  [33]).  This  ap¬ 


proach  has  been  proposed  for  network  measurements  [24]  and 
is  under  consideration  in  the  IP  Performance  Metrics  working 
group  of  the  lETE  [9].  We  compare  the  effect  of  using  constant 
rate  probes  and  Poisson  probes.  In  most  cases  the  difference  in 
accuracy  is  quite  small.  We  And  a  far  greater  degradation  in  ac¬ 
curacy  when  network  round  trip  times  were  reduced  below  the 
interprobe  time. 

The  remaining  sections  of  the  paper  are  organized  as  follows. 
After  a  review  of  related  work,  in  Section  II  we  describe  the  loss 
model,  in  Section  III  the  MLE  and  its  properties.  In  Section  IV 
we  describe  the  algorithm  used  to  compute  the  MLE  from  data. 
We  discuss  our  framework  for  quantifying  the  errors  in  infer¬ 
ence  in  Section  V.  The  simulations  themselves  are  reported  in 
Section  VI. 


C.  Related  Work 

In  the  opening  paragraphs  we  listed  a  number  of  ongoing 
measurement  infrastructure  projects  in  progress  ([6],  [8],  [14], 
[31]).  We  believe  our  multicast-based  techniques  would  be  a 
valuable  addition  to  these  measurement  platforms. 

Simultaneously  with  the  present  work,  Ratnasamy  and  Mc- 
Canne  [26]  have  proposed  using  a  multicast-based  loss  estima¬ 
tor  to  infer  topology.  The  emphasis  in  their  study  is  on  grouping 
multicast  receivers,  rather  than  estimating  the  loss  probabilities 
themselves.  They  use  the  same  estimate  as  we  do  for  loss  on  the 
shared  path  to  two  receivers,  and  this  gives  rise  to  an  algorithm 
for  inferring  binary  trees.  Ad  hoc  extensions  to  trees  with  higher 
branching  ratios  are  proposed. 

There  is  a  multicast-based  measurement  tool,  mtrace  [17], 
already  in  use  in  the  Internet,  mtrace  reports  the  route  from 
a  multicast  source  to  a  receiver,  along  with  other  information 
about  that  path  such  as  per-hop  loss  and  delay  statistics.  Topol¬ 
ogy  discovery  through  mtrace  is  performed  as  part  of  the 
tracer  tool  [13].  However,  mtrace  suffers  from  perfor¬ 
mance  and  applicability  problems  in  the  context  of  large-scale 
measurements.  Eirst,  mtrace  traces  the  path  from  the  source 
to  a  single  receiver  by  working  back  through  the  multicast  tree 
starting  at  that  receiver.  In  order  to  cover  the  complete  multi¬ 
cast  tree,  mtrace  needs  to  run  once  for  each  receiver,  which 
does  not  scale  well  to  large  numbers  of  receivers.  In  contrast, 
the  inference  techniques  described  in  this  paper  cover  the  com¬ 
plete  tree  in  a  single  pass.  Second,  mtrace  relies  on  multi¬ 
cast  routers  to  respond  to  explicit  measurement  queries.  Cur¬ 
rent  routers  support  these  queries.  However,  Internet  service 
providers  may  choose  to  disable  this  feature  since  it  gives  any¬ 
one  access  to  detailed  delay  and  loss  information  about  paths  in 
their  part  of  the  network.  (We  have  received  reports  that  this  is 
already  occurring).  In  contrast,  our  inference  techniques  do  not 
rely  on  cooperation  from  any  network-internal  elements. 

There  has  been  some  ad  hoc,  statistically  non-rigorous  work 
on  deriving  link-level  loss  behavior  from  end-to-end  multicast 
measurements.  An  estimator  proposed  in  [34]  attributes  the  ab¬ 
sence  of  a  packet  at  a  set  of  receivers  to  loss  on  the  common  path 
from  the  source.  However,  this  is  biased,  even  as  the  number  of 
probes  n  goes  to  infinity. 
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II.  Description  of  the  Loss  Model 

Let  T  =  (V,  L)  denote  the  logical  (as  opposed  to  physical) 
multicast  tree,  consisting  of  the  set  of  nodes  V,  including  the 
source  and  receivers,  and  the  set  of  links  L,  which  are  ordered 
pairs  {i,k)  of  nodes,  indicating  a  (directed)  link  from  j  to  k. 
The  set  of  children  of  node  j  is  denoted  by  d(i)-,  these  are  the 
nodes  with  a  link  coming  from  j.  For  each  node  j,  other  than 
the  root  0,  there  is  a  unique  node  /(i),  the  parent  of  j,  such  that 
i  E  d(  f(i)).  Each  link  can  therefore  be  identihed  by  its  “child” 
endpoint.  We  define  “ancestors”  (grandparents  and  the  like)  in 
an  obvious  way,  and  likewise  “descendants”.  The  difference  be¬ 
tween  a  logical  and  a  physical  tree  is  that,  whereas  it  is  possible 
for  a  node  to  have  only  one  child  in  the  physical  tree,  in  the  log¬ 
ical  tree  each  node  except  the  root  and  leaves  must  have  at  least 
two  children.  A  physical  tree  can  be  converted  into  a  logical  tree 
by  deleting  all  nodes,  other  than  the  root,  which  have  one  child 
and  adjusting  the  links  accordingly. 

The  root  0  E  V  represents  the  source  of  the  probes  and  the 
set  of  tea/ nodes  R  C  V  (i.e.,  those  with  no  children)  represents 
the  receivers. 

A  probe  packet  is  sent  down  the  tree  starting  at  the  root.  If  it 
reaches  a  node  j  a  copy  of  the  packet  is  produced  and  sent  down 
the  link  toward  each  child  of  j.  As  a  packet  traverses  a  link  k 
(recall  that  k  denotes  the  endpoint),  it  is  lost  with  probability 
ak  =  1  —  ctk  and  arrives  at  k  with  probability  a^.  We  shall 
use  the  notation  a  =  1  —  a  for  any  quantity  a  (with  or  without 
subscripts)  between  0  and  1 .  The  losses  on  different  links  are  as¬ 
sumed  to  be  independent  and  to  occur  with  the  probabilities  ctk 
as  described.  In  [2]  we  have  discussed  the  potential  limitations 
of  this  model,  and  how  the  model  can  be  corrected  if  there  are 
dependencies  between  the  losses.  The  two-leaf  logical  multicast 
tree  is  shown  in  Figure  1 . 

We  describe  the  passage  of  probes  down  the  tree  by  a  stochas¬ 
tic  process  A  =  {Xk)kev  where  each  Afe  equalsOorl:  Xk  =  1 
signihes  that  a  probe  packet  reaches  node  k,  and  0  that  it  does 
not.  The  packets  are  generated  at  the  source,  so  Aq  =  1.  For 
all  other  k  E  V ,  the  value  of  Xk  is  determined  as  follows.  If 
Xk  =  0  then  Xj  =  0  for  the  children  j  of  k  (and  hence  for  all 
descendants  of  k).  If  A^  =  1,  then  for  j  a  child  of  k,  X,  =  1 
with  probability  aj,  and  Xj  =  0  with  probability  aj,  indepen¬ 
dently  for  all  the  children  of  k.  We  write  oq  =  1  to  simplify 
expressions  concerning  the  . 

III.  Maximum  Likelihood  Estimation  of  Loss 

If  a  probe  is  sent  down  the  tree  from  the  source,  the  outcome 
is  a  record  of  whether  or  not  a  copy  of  the  probe  was  received  at 
each  receiver.  Expressed  in  terms  of  the  process  A,  the  outcome 
is  a  conhguration  A(/})  =  {Xk)keR  of  zeroes  and  ones  at  the 
receivers  (1  =  received,  0  =  lost).  Notice  that  only  the  values 
of  A  at  the  receivers  are  observable;  the  values  at  the  internal 
nodes  are  invisible.  The  state  space  of  the  observations  A(/})  is 
thus  the  set  of  all  such  conhgurations,  SI  =  {0, 1}^.  For  a  given 
set  of  link  probabilities  a  =  {ak)kev,  the  distribution  of  A(/}) 
on  SI  will  be  denoted  by  Pq,.  The  probability  mass  function  for 
a  single  outcome  *  £  SI  isp(*;  a)  =  Pa(A(/})  =  x). 

Let  us  dispatch  n  probes,  and,  for  each  *  £  SI,  let  n(x)  denote 
the  number  of  probes  for  which  the  outcome  x  is  obtained.  The 


probability  of  n  independent  observations  *  ^ , . . . ,  (with  each 

=  (*™)fe6rt)isthen 

n 

a)  =  p{x]  (1) 

m  =  l 

We  estimate  a  using  maximum  likelihood,  based  on  the  data 
(n(x))x£ft,  and  we  find  that  the  usual  regularity  conditions  that 
imply  good  large-sample  behavior  of  the  MLE  are  satisfied  in 
the  present  situation.  This  is  useful  for  the  applications  we  have 
in  mind  because  (a)  we  want  to  assess  the  accuracy  of  our  es¬ 
timates  via  confidence  intervals,  and  (b)  it  is  important  to  de¬ 
termine  the  smallest  number  n  of  probes  needed  to  achieve  the 
desired  accuracy.  We  want  to  minimize  n  because,  although 
sending  out  probes  is  inexpensive  in  itself,  networks  are  subject 
to  various  fluctuations  (e.g.,  [20])  which  can  perturb  the  model, 
and  the  measurement  process  itself  ties  up  network  resources. 

We  begin  with  a  review  of  our  main  results  on  the  exis¬ 
tence  and  uniqueness  of  the  MLE.  Another  question,  not  treated 
here,  but  which  is  important  for  applications,  is  the  feasibility 
and  organization  of  the  computations.  We  work  with  the  log- 
likelihood  function 

jC(a)  =  logp(x^,  .  .  . ,  x";  a)  =  n(x)  logp(x;  a).  (2) 

X^fl 

In  the  notation  we  suppress  the  dependence  of  £  on  n  and 
x^, . . . ,  x".  For  each  node  k,  let  i2{k)  be  the  set  of  outcomes 
X  E  ^  such  that  xj  =  1  for  at  least  one  receiver  j  E  R  which 
is  a  descendant  of  k,  and  let  jk  =  :=  PQ,[Sl(fc)].  An 

estimate  of  jk  is 

%  =  P{x),  (3) 

x£fl(k) 

where  p{x)  :=  n(x)  jn  is  the  observed  proportion  of  trials  with 
outcome  x.  We  will  show  how  to  And  a  as  a  function  of  the  7. 
The  MLE  a  is  precisely  that  a  which  maximizes  C(a): 

a  =  argmax„g[o_^]i£(a)  (4) 

We  shall  see  that,  at  least  for  large  n,  a  =  using  the 

inverse  of  the  function  L  that  expresses  the  jk  in  terms  of  the 
ak  ■  Candidates  for  the  MLE  are  solutions  a  of  the  likelihood 
equation: 

dC 

- — (a)  =  0,  kEU.  (5) 

oak 

Set  A  =  {{ak)k£U  '■  ak  >  0},  and  Q  =  {{■^k)k£U  '■  Ik  > 
0  Vfc;  7fe  <  'Zj£d{k)  Ij  Vfc  e  L  \  i?}. 

Theorem  1:  When  j  E  G,  the  likelihood  equation  has  the 
unique  solution  a  :=  that  can  be  expressed  as  follows. 

Define  {Ak)kev  for  the  root  node  by  Aq  =  1,  for  leaf  nodes 
k  E  R  hy  Ak  =  jk ,  and  for  all  other  nodes  k  E  U  \  R  as  the 
unique  solution  in  (0, 1]  of 

i-lk/Ak=  {l-^j/Ak).  (6) 

j£d{k) 

Then  for  k  E  U ,  ctk  =  Ak  jAfi^k). 
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The  form  (6)  follows  from  the  corresponding  relations  that  ex¬ 
press  in  terms  of  Ak  :=  ctkctjik)  ■  ■  ■  «o- 

We  complete  the  picture  by  showing  that  the  solution  of  the 
likelihood  equation  actually  maximizes  the  likelihood  function 
under  some  additional  conditions.  The  set  A  contains  all  posi¬ 
tive  a k ,  including  the  possibility  a  ^  >  1 .  Let  us  now  restrict  our 
attention  to  link  probabilities  a  e  jB  =  (0, 1)*^  C  A.  Being 
a  solution  of  the  likelihood  equation  does  not  preclude  a  from 
being  either  a  minimum  or  a  saddlepoint  for  the  likelihood  func¬ 
tion,  with  the  maximum  falling  on  the  boundary  of  B.  For  some 
simple  topologies  we  are  able  to  establish  directly  that  £(a)  is 
(jointly)  concave  in  the  parameters  at  a  =  a,  which  is  hence 
the  MLE  a.  For  more  general  topologies  we  use  general  results 
on  maximum  likelihood  to  show  that  a  =  a  for  all  sufficiently 
large  n. 

Theorem  2: 

(i)  The  model  is  identifiable  in  B,  i.e.,  a,a'  E  B  and  Pq,  =  Pq,/ 
implies  a  =  a'.  Thus,  distinct  link  probabilities  a  produce 
distinct  statistical  behavior  of  the  7  as  n  — ;>  cxd. 

(ii)  As  n  ^  00,  a  ^  a,  with  Pq,-  probability  1,  i.e.,  the  MLE 
is  strongly  consistent. 

(Hi)  With  probability  1,  for  sufficiently  large  n,  a  =  a,  i.e.,  the 
solution  of  the  likelihood  equation  maximizes  the  likelihood. 
This  is  proven  using  large  sample  theory  for  MLE,  such  as  in 
[30].  Einally  we  have  a  result  on  asymptotic  normality  of  the 
MLE.  The  Fisher  Information  Matrix  at  a  based  on  ^(_r)  is  the 

matrix  Ijk(a)  :=  Cov  (^(a),  ^(a))- 

Theorems:  X(a)  is  non-singular,  and  as  n  — ;>  cxd,  under 
Pa,  \/n(a  —  a)  converges  in  distribution  to  a  multivariate  nor¬ 
mal  random  vector  with  mean  vector  0  and  covariance  matrix 

I-Ha). 

Example:  MLE  for  the  Two-Leaf  Tree.  Denote  the  4  points 
offi  =  {0,l}2by{00,01,10,ll}.Then 

A  =  t(ii)  +  F(io)  +  F(oi), 

72  =  F(ii) +t(io),  73  =  t(ii) +  F(oi), 

and  equations  (6)  for  Ak  in  terms  of  the  jk  yield 

72% 

72  +  73  -  71 

(F(oi)+F(ii))(F(io)+p(ii)) 

pill) 

_  72  +  %  -  71  _  F(ii) 

73  p(01)-fp(ll) 

_  72  +  %  -  71  _  F(ii) 

72  p(10)-fp(ll) 

Note  that  although  it  is  possible  that  ai  >  1  for  some  finite  n, 
this  will  not  happen  when  n  is  sufficiently  large,  due  to  Theo¬ 
rem  2. 

IV.  Computation  of  the  MLE  on  a  General  Tree 

In  this  section  we  describe  the  algorithm  for  computing  a  on 
a  general  tree.  An  important  feature  of  the  calculation  is  that 
it  can  be  performed  recursively  on  trees.  Eirst  we  show  how  to 
calculate  the  %  ■  These  can  be  calculated  by  reconstruction  of  a 


(7) 

(8) 

(9) 

(10) 

(11) 

(12) 


procedure  main  (k  )  { 
find_x  {  k  )  ; 

infer  {  k,  1  )  ; 


procedure  find_x  {  k  )  { 
foreach  {  j  G  d{k)  )  { 

Xj  =  find_x  {  j  )  ; 

foreach  {  iG{l,---,7i}  )  { 

Xk[i]  =  Xk[i]'^  Xj[i]  ; 

} 

} 

return  X^  ; 


procedure  infer  {  k,  A  )  ; 

Ak  =  solveforC  Ak  ,  (1-^)  ==  n,gd(fc)(l-^)  > ' 

0!k  =  Ak/A  ; 

foreach  (  j  G  d(k)  )  { 

infer  {  j  ,  Ak  )  ; 

} 


Fig.  2.  Pseudocode  for  Inference  of  Link  Probabilities 


sample  path  of  the  full  process  {Xk)kev  that  is  consistent  with 
the  measured  data  ,  X^^^  from  n  probes.  We  define 

the  n-element  binary  vector  {Xk)kev  recursively  by 


Xk 

Xk{i) 


so  that 


Ik 


Xk,  kER 

(13) 

\J  Xj(i),  kEV\R 

(14) 

jed{k) 

n 

(15) 

i  =  l 


Eor  simplicity  we  assume  now  that  7  E  r((0,l)^'^).  The 
calculation  of  a  can  be  done  by  another  recursion.  We  formu¬ 
late  both  recursions  in  pseudocode  in  Eigure  2.  The  procedure 
f  ind_x  calculates  the  Xk  and  %,  assuming  Xk  initializes  to 
Xk  forkER  and  0  otherwise.  The  procedure  infer  calcu¬ 
lates  the  ctk  ■  The  procedures  could  be  combined.  The  full  set 
of  link  probabilities  is  estimated  by  executing  main  ( 1 ) ;  recall 
1  is  the  single  descendant  of  the  root  node  0.  Here,  an  empty 
product  (which  occurs  when  the  first  argument  of  infer  is  a 
leaf  node)  is  understood  to  be  zero.  Here  solvef  or  is  a  rou¬ 
tine  that  finds  the  unique  solution  Ak  in  (0, 1]  to  (6). 

The  recursive  nature  of  the  algorithm  has  important  conse¬ 
quences  for  its  implementation  in  a  network  setting.  The  calcu¬ 
lation  of  7^  and  Afe  depends  on  A  only  through  the  {Xj)j^d{k)- 
In  a  networked  implementation  this  would  enable  the  calculation 
to  be  localized  in  subtrees  at  a  representative  node.  The  compu¬ 
tational  effort  at  each  node  would  be  at  worst  proportional  to  the 
depth  of  the  tree  (for  the  node  which  is  unlucky  enough  to  be 
the  representative  for  all  distinct  subtrees  to  which  it  belongs). 
The  network  load  induced  by  the  communication  of  data  could 
be  kept  local,  e.g.,  by  scoped  multicast  amongst  sibling  repre¬ 
sentatives. 
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Fig.  3.  Simulation  Topology:  Links  are  of  two  types:  “edge”  links  of  1.5Mb/s  capacity  and  10ms  latency,  and  interior  links  of  5Mb/s  capacity  and  50ms 
latency.  LEFT:  “regular”  topology  with  branching  ratio  2.  RIGHT:  “irregular”  topology. 


V.  Framework  for  Simulation  Study 

We  evaluated  our  loss  inference  algorithm  using  the  ns  simu¬ 
lator  [19].  This  enabled  us  to  investigate  the  effectiveness  of  the 
estimator  over  a  range  of  network  topologies,  link  delays,  packet 
drop  policies,  background  traffic  types,  and  probe  traffic  types. 
In  particular  we  were  able  to  determine  the  actual  loss  experi¬ 
enced  by  background  traffic,  and  by  probe  traffic,  and  compare 
these  values  to  those  predicted  by  the  inference  algorithm  on  the 
basis  of  measurements  at  the  leaf  nodes.  The  experiments  show 
that  the  agreement  between  inferred  and  probe  loss  is  extremely 
good.  This  shows  that  the  model  of  probe  loss  and  the  associated 
inference  technique  are  quite  effective  in  the  small  networks 
used  in  the  simulation.  This  is  encouraging  since  we  expect 
flow  synchronization  effects  (that  would  violate  the  model)  to 
be  more  noticeable  amongst  a  smaller  numbers  of  flows.  Agree¬ 
ment  between  inferred  loss  and  background  traffic  loss  is  quite 
reasonable,  although  not  as  close  as  between  inferred  and  probe 
loss.  Some  difference  is  expected  due  to  the  difference  in  tem¬ 
poral  statistics  of  TCP  flows  and  probes. 

A.  Comparing  Loss  Probabilities 

We  describe  our  approach  to  comparing  two  sets  of  loss  prob¬ 
abilities  p  and  q.  For  example  p  could  be  an  inferred  probability 
on  a  link,  q  the  corresponding  actual  probability.  For  some  error 
margin  e  >  0  we  define  the  error  factor 

=  (16) 

where  p(e)  =  max{e,p}  and  q{s)  =  max{e,  q}.  Thus,  we  treat 
p  and  q  as  being  not  less  than  e,  and  having  done  this,  the  error 
factor  is  the  maximum  ratio,  upwards  or  downwards,  by  which 
they  differ.  Unless  otherwise  stated,  we  used  the  default  value 
s  =  10“^  in  this  paper.  The  choice  of  this  metric  is  motivated  by 
the  expectation  that  it  is  desirable  to  estimate  the  relative  mag¬ 
nitude  of  loss  ratios  on  different  links  in  order  to  distinguish 


those  which  suffer  higher  loss.  In  summarizing  the  relative  ac¬ 
curacy  of  a  set  of  loss  measurements,  we  will  calculate  statis¬ 
tics  of  the  error  factor,  such  as  mean  and  quantiles  of  Fc(pi,qi) 
where  p  =  (pi)  and  q  =  (qi)  are  two  sets  of  loss  probabilities 
(inferred  and  actual,  say).  Here  the  index  i  runs  over  a  set  of 
links,  a  set  of  measurements  on  the  same  link  made  at  different 
times  or  during  different  simulations,  or  some  combination  of 
these. 

B.  Summary  Statistics  of  the  Error  Factor 

In  describing  the  mean  and  variability  of  the  error  factors,  we 
shall  use  the  following  summary  statistics.  We  shall  estimate 
the  center  of  the  distribution  of  a  set  of  error  factors  Xi  by  the 
two-sided  quartile-weighted  median 

:=  (Q  25  +  2Q  5  +  0.75)/4  (17) 

where  Qp  denotes  the  pth 

quantile  of  the  .  m  is  particularly 
suited  to  skewed  distributions;  see  [29]  for  further  detail.  We 
characterize  the  high  values  of  the  error  factors  through  the  90**^ 
percentile.  Both  these  summary  statistics  are  robust,  being  inde¬ 
pendent  of  any  assumption  on  the  distribution  of  the  error  fac¬ 
tors. 

C.  Experimental  Variables 

We  explored  the  performance  of  the  inference  algorithm  un¬ 
der  variation  of  the  following  quantities. 

C.l  Network  Topology 

We  investigated  three  topologies.  We  used  the  two-leaf  binary 
tree  of  Figure  1  to  explore  the  variables  listed  below  within  a 
tightly  controlled  environment.  We  also  explored  two  larger  bi¬ 
nary  topologies:  the  regular  8  leaf  binary  tree  of  Figure  3(left), 
and  the  irregular  tree  of  Figure  3 (right).  In  both  of  the  larger 
trees  we  arranged  for  some  heterogeneity  between  the  edges 
and  the  center  in  order  to  mimic  the  difference  between  the  core 
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Fig.  4.  Accuracy  of  Inference  vs.  Sample  Window:  Mean  en'or  factor  over  all  links  and  windows  of  regular  topology  in  Figure  3(left)  for  RED  or 
DropTail  queueing;  Poisson  or  CBR  probes.  LEFT:  inferred  loss  vs.  probe  loss.  RIGHT:  inferred  loss  vs  background  loss.  Probe  bytes  are  1.8%  of  of  total; 
average  utilization  is  60%. 


and  edges  of  a  large  WAN,  with  the  interior  of  the  tree  having 
higher  capacity  (5Mb/sec)  and  latency  (50ms)  than  at  the  edge 
(IMb/sec  and  10ms). 

C.2  Packet  Discard  Method 

Each  node  had  a  buffer  capacity  of  20  packets,  independent 
of  packet  size.  We  compare  the  effects  of  two  methods  of 
packet  discard;  Drop  from  Tail  (DT),  and  discard  based  on  Ran¬ 
dom  Early  Detection  (RED)  [7].  One  of  the  benehts  expected 
from  the  deployment  of  RED  is  increased  utilization  through 
the  breaking  of  synchronization  that  can  occur  due  to  slow  start 
of  TCP  after  congestion,  as  identihed  in  [10].  We  used  the  ns 
default  parameters  of  RED  in  the  simulations. 

C.3  Background  Traffic 

Each  of  the  trees  was  equipped  with  a  variety  of  flows  of  back¬ 
ground  traffic.  Elows  were  of  two  types:  infinite  data  sources 
that  use  the  Transmission  Control  Protocol  (TCP),  and  on-off 
sources  using  the  Unreliable  Datagram  Protocol  (UDP),  the  on 
and  off  periods  having  either  a  Pareto  or  an  exponential  distri¬ 
bution.  In  most  of  the  simulations  on  the  larger  trees  we  used 
predominantly  TCP,  with  a  mixture  of  UDP.  We  chose  this  mix 
because  TCP  is  the  dominant  transport  protocol  on  the  Inter¬ 
net  [32]. 

C.4  Probe  Characteristics 

It  is  desirable  that  probe  traffic  only  use  a  small  part  of  the 
available  link  capacity.  Eor  the  experiments  in  the  large  topolo¬ 
gies  we  used  40-byte  probes  with  a  mean  interprobe  time  of 
16ms,  i.e.  a  20  kbit/sec  stream.  This  is  just  over  1%  of  the 
capacity  of  the  smallest  link  used;  it  would  be  a  far  smaller  frac¬ 
tion  of  capacities  commonly  used  in  today's  Internet  backbones. 
We  used  two  types  of  probes:  constant  rate  probes  and  Poisson 
probes.  The  use  of  the  latter  has  been  proposed  [24]  for  end-to- 
end  measurements  on  the  basis  that  Poisson  Arrivals  See  Time 
Averages;  see  e.g.  [33]. 


roundtrip  time  determines  the  time  it  takes  TCP  to  respond  to 
packet  losses.  Thus  the  relative  size  of  this  time  and  the  inter¬ 
probe  time  determines  the  number  of  probe  packets  that  sample 
congestion  due  to  TCP  traffic.  In  these  experiments  we  reverted 
to  a  uniform  link  latency  of  between  1ms  and  100ms. 

VI.  Simulation  Results 
A.  Qualitative  Sample  Path  Behavior 

We  start  by  illustrating  some  properties  of  sample  paths  of  the 
MLE.  We  shall  make  mostly  qualitative  observations  initially; 
quantitative  statistical  measures  of  the  accuracy  of  inference  will 
be  applied  later. 

In  the  regular  topology  of  Eigure  3(left)  we  conducted  exper¬ 
iments  of  240  seconds  duration.  Background  traffic  was  gen¬ 
erated  by  30  infinite  ETP  sources  using  TCP,  and  another  30 
on-off  UDP  sources,  mostly  with  low  rates  and  either  exponen¬ 
tial  or  Pareto  distributed.  There  was  one  experiment  for  each  of 
the  four  combinations  of  DropTail  or  RED  packet  discard  and 
Poisson  or  CBR  probes.  The  mean  time  between  probes  was 
16ms,  so  about  15,000  probes  were  used  in  each  experiment. 
Eor  each  of  the  experiments  we  calculated  a  on  a  moving  win¬ 
dow  of  a  given  width,  using  jumps  of  half  the  width.  We  display 
the  mean  error  factor  as  a  function  of  window  size  in  Eigure  4. 
On  the  left  we  show  the  error  factor  between  inferred  and  actual 
probe  loss;  on  the  right  between  inferred  and  actual  background 
loss.  The  main  points  to  observe  are  that  (i)  error  factors  de¬ 
crease  as  window  size  increases;  (ii)  the  error  factor  between 
inferred  and  probe  losses  is  small  when  compared  with  that  be¬ 
tween  inferred  and  background  losses;  (iii)  the  error  factors  are 
reasonably  insensitive  to  choice  of  packet  discard  method  and 
probe  type.  To  the  extent  that  there  are  differences,  mean  error 
factors  between  inferred  and  background  losses  for  CBR  probes 
are  slightly  smaller  than  for  Poisson  probes,  at  least  for  larger 
window  sizes  (about  1.2  compared  with  about  1.5).  Error  fac¬ 
tors  for  RED  are  marginally  worse  than  for  DropTail.  We  shall 
comment  upon  these  differences  later. 


C.5  Relative  Time  Scales 

We  investigated  the  effects  of  network  roundtrip  time  on  es¬ 
timator  accuracy.  This  is  potentially  important  because  the 


B.  Dynamic  Tracking  of  Loss 

In  Eigure  5  we  display  the  time  series  of  background,  probe 
and  inferred  loss  on  one  link  over  the  moving  windows  of  a  sim- 
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Fig.  5.  Dynamic  Accuracy  of  Inference:  Loss  rates  of  background  packets,  probe  packets  and  infen'ed  on  link  8  in  regular  topology  in  Figure  3(left)  for 
RED  queueing  and  Poisson  probes.  LEFT:  5  second  window.  RIGHT:  20  second  window.  Additional  sources  started  at  60  seconds;  note  tracking  by  estimator 
of  induced  congestion.  Probe  bytes  are  2%  of  total  on  1.5Mb/s  link  with  60%  utilization. 
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Fig.  6.  Accuracy  and  Ordering  of  Inference  vs.  Sample  Window:  Loss  rates  in  regular  topology  of  Figure  3(left)  for  RED  queueing  and  Poisson 
probes.  LEFT:  5  second  window.  RIGHT:  240  second  window.  Lines  join  probabilities  of  a  given  link.  Fewer  crossings  indicate  better  preservation  of  order 
between  actual  and  estimated  probabilities.  Flatter  lines  indicate  better  accuracy  of  estimates.  Probe  bytes  are  3%  of  total  on  1 .5Mb/s  link  with  50%  utilization. 


ulation  similar  to  that  just  described.  However,  we  arrange  for 
some  additional  sources  to  be  turned  on  after  60  seconds  have 
elapsed.  We  display  how  inferred  losses  track  the  real  ones  on  a 
5  second  window  (left)  and  a  20  second  window  (right).  There 
is  considerable  variability  between  the  inferred  and  actual  loss  at 
the  5  second  window,  not  all  of  which  is  removed  by  increasing 
to  a  20  second  window.  However,  even  at  the  5  second  window 
it  appears  that  the  estimator  responds  rapidly  to  the  increase  in 
actual  loss  that  occurs  after  60  seconds  have  elapsed. 

From  Figure  5  it  is  evident  that  the  inferred  loss  tracks  the 
probe  loss  more  closely  than  the  loss  of  background  packets. 
Increasing  the  window  size  narrows  some  of  the  difference.  We 
illustrate  this  for  a  single  window  in  Figure  6.  For  a  5  second 
and  a  240  second  window,  we  display  how  the  ordering  of  the 
links  according  to  loss  probability  differs  according  to  whether 
the  loss  used  for  ordering  is  that  for  background  or  probe  or 
inferred  loss.  To  do  this  we  have  placed  each  set  of  probabilities 
on  an  axis  (background  loss  on  left,  probe  loss  in  middle  and 
inferred  loss  on  right)  and  joined  the  values  for  given  links.  The 
flatter  the  lines,  the  greater  the  accuracy;  the  less  they  cross,  the 
better  the  ordering  is  preserved.  In  this  example,  both  accuracy 
and  ordering  are  improved  by  using  the  larger  window.  It  is  clear 
in  this  example  that  despite  error  factors  of  about  2  between 
some  of  the  inferred  and  background  traffic  losses,  the  inference 


is  sufficiently  accurate  to  distinguish  the  links  with  the  highest 
loss  for  either  probe  or  background  packets. 

C.  Quantitative  Statistical  Measures  of  Accuracy 

We  now  present  some  broad  statistical  measures  of  the  ac¬ 
curacy  of  the  inference  in  different  network  configurations  in 
topologies  with  15  links.  We  conducted  10  experiments  of  240 
seconds  duration  for  each  of  the  four  combinations  of  DropTail 
or  RED  packet  discard  with  CBR  or  Poisson  probes.  We  then 
calculated  the  center  m  and  90**^  percentile  of  the  150  error  fac¬ 
tors  (10  experiments  X  15  links). 

The  results  are  tabulated  for  the  regular  topology  with  mixed 
TCP  and  UDP  sources  in  Table  I;  for  the  regular  topology  with 
TCP  source  only  in  Table  II;  and  for  the  irregular  topology  with 
mixed  sources  in  Table  III.  Taking  these  as  a  group,  the  accuracy 
of  inference  of  probe  loss  is  striking.  Looking  at  the  first  pair  of 
columns  in  each  table  we  see  that  the  error  is  no  more  than  2% 
of  the  true  value  on  average  (i.e.  an  error  factor  1.02),  the  90**^ 
percentile  of  the  error  being  17%  of  the  true  value  at  worst  (i.e. 
an  error  factor  1.17). 

The  error  factors  between  actual  probe  loss  and  background 
traffic  loss  are  somewhat  larger;  this  difference  is  then  the  main 
contribution  to  errors  in  inferring  the  background  traffic  loss  by 
the  probe  loss.  The  center  m  is  less  than  1.5,  and  the  90**^  per- 
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Fig.  7.  Estimator  Accuracy:  Scatter  plots  of  1110  pairs  of  loss  probabilities  gathered  from  all  simulations:  LEFT:  inferred  loss  vs.  probe  loss;  RIGHT: 
inferred  loss  vs.  background  loss.  All  probabilities  truncated  with  error  margin  e  =  10 
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1.56 
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1.10 

1.36 
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TABLE  I 

Statistics  OF  Error  Eactor  VS.  Packet  Discard  and  Probe 
Method.  TCP  andUDP  background  traffic.  Regular  Topology.  Weighted 
Median  and  90'^*'  percentile  of  error  factor  over  all  links  during  10  simulations 
of  240  seconds.  Error  margin  was  e  =  10  .  In  about  20%  of  cases,  one  or 

both  probabilities  compared  were  less  than  e. 


TABLE  III 

Statistics  OF  Error  Eactor  VS.  Packet  Discard  and  Probe 
Method.  TCPandUDP  background  traffic.  Irregular  Topology.  Mean  and 
90'^*'  percentile  of  error  factor  over  all  links  during  10  simulations  of  240 
seconds.  Error  margin  was  e  =  10“^.  In  no  more  than  8%  of  cases,  one  or 
both  probabilities  were  less  than  e. 
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RED 
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1.53 

TABLE  II 

Statistics  OF  Error  Eactor  VS.  Packet  Discard  and  Probe 
Method.  TCP  background  traffic  only.  Regular  Topology.  Weighted  Median 
and  90'^*'  percentile  of  error  factor  over  all  links  during  10  simulations  of  240 
seconds.  Error  margin  was  e  =  10“^.  In  about  15%  of  cases,  one  or  both 
probabilities  compared  were  less  than  e. 
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1.53 
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1.09 
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1.28 

1.88 

1.19 

1.49 
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1.49 

6.83 

1.30 

1.61 

1.71 

5.07 

TABLE  IV 

Statistics  of  Error  Eactor  vs.  Link  Delay.  TCP  and  UDP 
background  traffic.  Regular  Topology.  DropTail  with  Poisson  Probes. 
Weighted  Median  and  90'^*'  percentile  of  error  factor  over  all  links  during  10 
simulations  of  240s  for  each  delay  value.  One  or  both  probabilities  compared 
were  less  than  error  margin  e  =  10  in  up  to  40%  of  cases. 


centile  is  less  than  2.2.  Pure  TCP  background  traffic  has  some¬ 
what  higher  error  factors  than  mixed  TCP  and  UDP.  The  irregu¬ 
lar  topology  has  somewhat  higher  error  factors  than  the  regular 
topology.  The  average  utilization  in  these  simulations  was  about 
60%.  We  also  conducted  simulations  at  up  to  90%  utilization  on 
the  two-leaf  binary  tree  with  approximately  the  same  number  of 
probes.  In  most  cases  the  summary  statistics  were  of  the  same 
order. 

Comparing  the  different  packet  discard  methods,  we  see  that 
RED  always  gives  somewhat  lower  values  for  m  and  the  90**^ 
percentile  than  the  corresponding  DropTail.  This  fits  with  our 
expectation  that  the  randomization  induced  by  RED  will  break 
correlations  induced  by  TCP  flow  control,  and  hence  cause  pat¬ 
terns  of  loss  for  background  traffic  to  more  closely  resemble  the 
Bernoulli  loss  model. 

Comparing  the  different  packet  probe  types,  we  see  that  CBR 
has  m  and  90**^  percentile  consistently  slightly  lower  than  for 
Poisson  probes.  The  reason  for  this  small  difference  is  not  clear 


at  present.  Poisson  probes  see  time  averages  [33]  and  hence 
yield  unbiased  measurements.  It  is  possible  though  that  they  ex¬ 
hibit  higher  variances  for  the  reason  that  the  potentially  extreme 
(long  or  short)  interarrival  times  lead  to  worse  sampling  of  net¬ 
work  congestion  events. 

We  examined  the  influence  of  network  propagation  delay  on 
error  factors.  Eor  DropTail  packet  discard  and  Poisson  queueing, 
we  And  (see  Table  IV)  that  error  factors  increase  as  propagation 
delay  decreases.  A  possible  explanation  for  this  is  the  follow¬ 
ing.  We  observe  an  increase  in  utilization  as  the  propagation 
delay  is  decreased,  the  utilization  being  close  to  100%  on  some 
links  when  propagation  delay  is  1ms.  Since  recovery  after  TCP 
losses  will  be  correspondingly  quick,  any  spare  capacity  will  be 
rapidly  exploited,  and  congestion  may  be  long  lived,  leading  to 
temporal  correlations  between  probe  losses.  Whereas  this  would 
not  alter  the  asymptotic  accuracy  of  the  MLE,  it  would  slow  the 
rate  of  convergence  as  the  number  of  probes  is  increased,  lead¬ 
ing  to  high  estimator  variance.  This  hypothesis  is  supported  by 
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Table  IV:  at  1ms  feedback  delay,  most  of  the  error  is  between 
the  inferred  and  probe  loss.  1ms  is  far  shorter  than  the  minimum 
link  propagation  delays  on  the  Internet,  so  we  do  not  expect  this 
phenomenon  to  occur  in  practice.  We  stress,  however,  that  it  re¬ 
mains  to  obtain  a  full  understanding  of  the  effect  on  accuracy  of 
the  interactions  between  interprobe  time,  propagation  delay  and 
variables  such  as  packet  discard  method  and  probe  type. 

We  summarize  all  our  experiments  hitherto  in  Figure  7,  where 
we  show  a  scatter  plot  of  pairs  of  (inferred  loss,  probe  loss)  on 
the  left,  and  pairs  of  (inferred  loss,  background  loss)  on  the 
right.  Thus  each  point  corresponds  to  a  single  link  on  a  sin¬ 
gle  simulation  run.  Also  included  here  are  points  for  experi¬ 
ments  conducted  with  the  combinations  of  traffic  types,  discard 
method,  probe  distribution  and  topology  described  above,  but 
with  a  more  variable  flow  duration.  The  flow  durations  were 
obtained  by  choosing  random  beginning  and  end  times  for  each 
flow  in  a  given  simulation,  rather  than  having  the  flows  present 
for  the  whole  simulation.  In  these  examples,  inferred  loss  is  a 
better  predictor  of  background  loss  when  the  latter  is  at  least 
1%:  for  this  subset  of  data  points  the  mean  error  factor  is  1.20 
compared  with  1 .28  for  the  complete  set. 

VII.  Conclusions 

In  this  paper  we  have  analyzed  the  efficacy  of  multicast-based 
inference  in  estimating  loss  probabilities  in  the  interior  of  a  net¬ 
work  from  end-to-end  measurements.  The  principal  tool  was  a 
Maximum  Likelihood  Estimator  of  the  link  loss  probabilities. 
Probes  are  multicast  from  a  source;  the  data  for  the  MLE  is  a 
record  of  which  probes  were  received  at  each  leaf  of  the  multi¬ 
cast  tree.  Although  the  method  assumes  that  losses  are  indepen¬ 
dent,  we  have  shown  in  some  cases  that  it  is  relatively  insensitive 
to  the  presence  of  spatial  loss  correlations;  temporal  correlations 
increase  its  variance,  so  that  a  longer  measurement  period  is  re¬ 
quired;  see  [2]. 

We  evaluated  the  method  by  conducting  ns  simulations  that 
used  topologies  and  traffic  flows  with  quite  a  rich  structure,  with 
several  hops  per  flow  and  flows  per  link.  We  compare  inferred 
and  actual  loss  probabilities  on  the  links  of  the  logical  multi¬ 
cast  tree.  The  experiments  showed  that  the  loss  probabilities  for 
probe  packets  were  inferred  extremely  closely  by  the  MLE. 

The  probe  traffic  was  typically  only  1%  to  2%  of  the  traffic 
on  each  link.  We  investigated  how  closely  loss  rates  for  back¬ 
ground  traffic  were  inferred.  We  examined  the  effect  of  chang¬ 
ing  traffic  mix,  topology,  packet  discard  method  and  probe  type. 
We  found  small  differences  between  these,  compared  with  the 
inherent  variability  of  the  estimates.  Varying  the  network  feed¬ 
back  delay  also  affected  the  accuracy  of  inference.  Eor  very 
short  propagation  delays  we  believe  that  the  aggressive  behav¬ 
ior  of  TCP  slow  start  is  a  factor  in  decreasing  accuracy.  We 
intend  to  investigate  this  phenomenon  more  fully. 

Over  a  range  of  experiments  our  summary  statistics  show  that 
the  relative  error  of  the  inferred  and  actual  losses  had  a  distribu¬ 
tion  whose  center  was  no  greater  than  about  1 .5  and  whose  90**^ 
percentile  was  no  worse  than  a  factor  of  about  2.2.  If  one  is  lim¬ 
ited  to  using  inferred  probe  loss  to  estimate  background  traffic 
loss,  this  would  mean  that  only  in  1%  of  the  worst  cases  would 
a  single  inference  fail  to  distinguish  between  two  background 
loss  rate  that  separated  by  a  factor  of  5.  We  believe  that  this  is 


sufficiently  accurate  to  identify  the  most  congested  links. 

References 

[1]  J-C.  Bolot  and  A.  Vega  Garcia  “The  case  for  FEC-based  error  control  for 
packet  audio  in  the  Internet”  ACM  Multimedia  Systems,  to  appear. 

[2]  R.  Caceres,  N.G.  Duffield,  J.  Horowitz,  D.  Towsley,  “Multicast- 
based  inference  of  network-internal  characteristics”,  Comp.  Sci.  Tech. 
Rep.  98-17,  University  of  Massachusetts  at  Amherst,  February  1998. 
f tp : / /gaia . cs . umass . edu/pub/ CDHT98 iMINC .ps . Z 

[3]  R.  L.  Carter  and  M.  E.  Crovella,  “Measuring  Bottleneck  Link  Speed  in 
Packet-Switched  Networks,”  PERFORMANCE  '96,  October  1996. 

[4]  A.  Dembo  and  O.  Zeitouni,  “Large  deviations  techniques  and  applica¬ 
tions”,  Jones  and  Bartlett,  Boston,  1993. 

[5]  B.  Effron  and  D.V.  Hinkley,  “Assessing  the  accuracy  of  the  maximum 
likelihood  estimator:  Observed  versus  expected  Fisher  information”, 
Biometrika,  65,  457^87, 1978. 

[6]  Felix:  Independent  Monitoring  for  Net¬ 

work  Survivability.  For  more  information  see 
f tp : //f tp . bellcore . com/pub/mwg/ f elix/ index.html 

[7]  S.  Floyd  and  V.  Jacobson,  “Random  Early  Detection  Gateways  for  Con¬ 
gestion  Avoidance,”  IEEE/ ACM  Transactions  on  Networking,  1(4),  Au¬ 
gust  1993. 

[8]  IPMA:  Internet  Performance  Measurement  and  Analysis.  For  more  infor¬ 
mation  see  http  :  / /www .merit .  edu/ ipma 

[9]  IP  Performance  Metrics  Working  Group.  For  more  information  see 
http : / /www . ietf . org/html . charters/ ippm- charter . html 

[10]  V.  Jacobson,  “Congestion  Avoidance  and  Control”,  Proceedings  of  ACM 
SIGCOMM  '88,  August  1988,  pp.  314-329. 

[1 1]  V.  Jacobson,  Pathchar  -  A  Tool  to  Infer  Characteristics  of  Internet  paths. 

For  more  information  see  ftp :  /  /  ftp .  ee .  Ibl .  gov/pathchar 

[12]  E.L.  Lehmann.  “Theory  of  point  estimation”.  Wiley-Interscience,  1983. 

[13]  B.N.  Levine,  S.  Paul,  J.J.  Garcia-Luna-Aceves,  “Organizing  multicast  re¬ 
ceivers  deterministically  according  to  packet-loss  correlation”.  Preprint, 
University  of  California,  Santa  Cruz. 

[14]  J.  Mahdavi,  V.  Paxson,  A.  Adams,  M.  Mathis,  “Creating  a  Scalable  Ar¬ 
chitecture  for  Internet  Measurement,”  to  appear  in  Proc.  INET  '98. 

[15]  M.  Mathis  and  J.  Mahdavi,  “Diagnosing  Internet  Congestion  with 
a  Transport  Layer  Performance  Tool,”  Proc.  INET  '96,  Montreal, 
June  1996. 

[16]  S.P.  Meyn  and  R.L.  Tweedie,  “Markov  chains  and  stochastic  stability”. 
Springer,  New  York,  1993. 

[17]  mtrace  —  Print  multicast  path  from  a  source 
to  a  receiver.  For  more  information  see 
f tp : // ftp .pare .xerox . com/pub/net- research/ ipmulti 

[18]  nam  —  Network  Animator.  For  more  information  see 

http : / /www- mash . cs . berkeley . edu/ns/nam . html 

[19]  ns  —  Network  Simulator.  For  more  information  see 

http : / /www-mash . cs . berkeley . edu/ns/ns . html 

[20]  V.  Paxson,  “End-to-End  Routing  Behavior  in  the  Internet,”  Proc.  SIG¬ 
COMM  '96,  Stanford,  Aug.  1996. 

[21]  V.  Paxson,  “Towards  a  Framework  for  Defining  Internet  Performance 
Metrics,”  Proc.  INET  '96,  Montreal,  1996. 

[22]  V.  Paxson,  “End-to-End  Internet  Packet  Dynamics,”  Proc.  SIGCOMM 
1997,  Cannes,  France,  139-152,  September  1997. 

[23]  V.  Paxson,  “Automated  Packet  Trace  Analysis  of  TCP  Implementations,” 
Proc.  SIGCOMM  1997,  Cannes,  France,  167-179,  September  1997. 

[24]  V.  Paxson,  “Measurements  and  Analysis  of  End-to-End  Internet  Dynam¬ 
ics,”  Ph.D.  Dissertation,  University  of  California,  Berkeley,  April  1997. 

[25]  J.  Postel,  “Transmission  Control  Protocol,”  RFC  793,  September  1981. 

[26]  S.  Ratnasamy  &  S.  McCanne,  “Inference  of  Multicast  Routing  Tree 
Topologies  and  Bottleneck  Bandwidths  using  End-to-end  Measure¬ 
ments”,  Proceedings  IEEE  Infocom’99,  New  York,  (1999). 

[27]  K.  Ross  &  C.  Wright,  “Discrete  Mathematics”,  Prentice  Hall,  Englewood 
Cliffs,  NJ,  1985. 

[28]  W.  Rudin,  “Functional  Analysis”,  McGraw-Hill,  New  York,  1973. 

[29]  L.  Sachs,  “Applied  Statistics”,  Springer,  New  York,  1982. 

[30]  M.J.  Schervish,  “Theory  of  Statistics”,  Springer,  New  York,  1995. 

[31]  Surveyor.  For  more  information  see 

http : //io . advanced . org/ surveyor/ 

[32]  K.  Thompson,  G.J.  Miller  and  R.  Wilder,  “Wide-Area  Internet  Traf¬ 
fic  Patterns  and  Characteristics,”  IEEE  Network,  11(6),  Novem¬ 
ber/December  1997. 

[33]  R.R.  Wolff  “Poisson  Arrivals  See  Time  Averages”,  Operations  Research, 

30:  223-231,1982 

[34]  M.  Yajnik,  J.  Kurose,  D.  Towsley,  “Packet  Loss  Correlation  in  the  MBone 
Multicast  Network,”  Proc.  IEEE  Global  Internet,  Nov.  1996 


236 


